Automatic FMD News Monitoring and Classification - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Automatic FMD News Monitoring and Classification

Description:

The FMD Lab manually collects FMD news on the Web to monitor the status of the ... New Zealand, world, sport, business & entertainment. 41. ABC Digital. 46 ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 28
Provided by: syndr
Category:

less

Transcript and Presenter's Notes

Title: Automatic FMD News Monitoring and Classification


1
Automatic FMD News Monitoring and Classification
  • Hsinchun Chen Yulei Zhang Cathy Larson
  • The BioPortal Team
  • Artificial Intelligence Lab
  • University of Arizona
  • Acknowledgements NSF ITR Program DHS/UADA UC
    Davis FMD Lab

2
Introduction
  • Global FMD impact UK 30B 2002.
  • The FMD Lab manually collects FMD news on the Web
    to monitor the status of the FMD around the
    world.

3
An Example of Outbreak News from Cattlenetwork

4
An Example of Outbreak Report from OIE
5
Data Sources for Syndromic Surveillance (contd)
  • Different Types of Data Sources for Syndromic
    Surveillance (Yan et al., 2008)

In this study, we focus on public sources for FMD
surveillance, which are news reports and bulletin
notification.
6
News-based Surveillance Systems (contd)
  • DIB/ARGUS monitors news sources and identifies
    social disruption indicators using a Bayesian
    model.
  • ProMED-mail is a manually based system. Reports
    are often contributed by ProMED-mail subscribers
    and reviewed by a team of experts.
  • MiTAP monitors many diseases and categorizes the
    information on disease level. It does not focus
    on FMD and it dose not classify news categories.
  • In this study, we aim to build an automatic
    news-based surveillance and classification system
    for FMD.

7
Research Questions
  • This study is aimed at building an automatic FMD
    news monitoring and classification system. We
    examine different machine learning techniques and
    different feature sets for on-line FMD news
    classification and develop the following research
    questions.
  • Q1. How can we monitor FMD news websites and
    bulletin notification?
  • Q2. Can the combination of Bag of Words, Noun
    Phrases, and Name Entities features outperform
    the Bag of Words features in FMD news
    classification?
  • Q3. Can the feature selection method help improve
    the performance of FMD news classification?
  • Q4. Which machine learning method performs better
    for FMD news classification?

8
Framework of On-line News Monitoring and
Classification
  • Data Collection
  • Web Spidering
  • Use spidering algorithms to gather web pages from
    on-line news sources.
  • News Filtering
  • Filter out unrelated news.
  • Document Representation and Feature Selection
  • Document Representation
  • Transform documents from full text version to
    document vectors.
  • Word Filtering
  • Filter out noisy words from the document vectors.
  • Stemming and Stopwording
  • Do word stemming and remove stop words.
  • Feature Evaluation and Generation
  • Search the feature space and evaluate different
    features.
  • Feature Selection
  • Select the features with high evaluation score.
  • Classification and Evaluation

9
System Design
10
Data Collection
Important FMD News On-line Sources Identified by
the FMD Lab (UC Davis)
We have developed spider programs to monitor
these important data sources.
11
Manual Collection of FMD News (contd)
  • Top 10 Websites in the Manual Collection

12
News Spidering
  • For the 31 important FMD news on-line sources
    identified by the FMD lab (UC Davis), we monitor
    29 of them (Yahoo and Google not included).
  • We have gathered more than 180,000 news files
    from the 29 on-line sources.
  • For the websites we identified from the FMD news
    manual collection, we monitor the top 101
    websites (each has more than 5 pieces of news in
    the collection).
  • We have gathered more than 650,000 news files
    from the 101 websites.
  • The 29 FMD News on-line sources are among the 101
    websites we identified.

13
News Filtering
  • After collecting news from FMD related news
    websites by using spidering programs, we need to
    filter out the news unrelated to FMD.
  • We use keyword based filtering method. All the
    news items that do not contain FMD keywords are
    filtered out as unrelated news.

14
Document Representation and Feature Selection
  • In this study, we use three document
    representation methods, Bag of Words, Noun
    Phrases and Name Entities.
  • First, we build two baseline feature subsets
    (BFS). One is composed of Bag of Words features
    (FeatureBFS-BW ), and the other is composed
  • of the combination of Bag of Words, Noun
    Phrases and Name Entities features
    (FeatureBFS-Comb). Features which appear more
    than 5 times
  • in the whole collection are taken into
    account.
  • Then we build another two feature subsets,
    FeatureSFS-BW and
  • FeatureSFS-Comb, by conducting feature
    selection on Feature BFS-BW and
  • Feature BFS-Comb using Correlation-based
    Feature Selection and
  • Best First Search.
  • We plan to compare the performance of these four
    feature subsets.

15
CFS (Correlation-based Feature Selection)
  • The advantage of CFS (Hall, 1998 Hall, 2000) is
    that it evaluates the group of attributes
    together rather than individual attributes (Hall
    and Holmes, 2003).
  • CFS uses a subset evaluation heuristic which
    assigns high scores to subsets containing
    attributes that are highly correlated with the
    class and have low intercorrelation with each
    other.
  • Hall and Holmes (2003) compared six feature
    selection methods and found that CFS performed
    well consistently on both Naïve Bayes and C4.5
    Decision Trees. They suggested that CFS is a good
    overall performer. They also found that CFS chose
    fewer features and thus performed faster than the
    others.

16
Three Categories of FMD News
  • The differences among some of the seven
    categories of FMD news are subtle.
  • Through discussion, the FMD Lab has confirmed
    that it is reasonable to combine these seven
    categories into the following 3 categories.
  • Category 1 FMD outbreak related
  • FMD outbreak in some location
  • FMD suspected in some location
  • FMD outbreak confirmed
  • FMD Follow-up report
  • Category 2 FMD control program related
  • FMD control program in some location
  • Category 3 FMD social and economic consequences
    and general information
  • FMD social and economic consequences
  • FMD General Information

17
Classification and Evaluation
  • Classification
  • In this study, we choose the four most widely
    used machine learning algorithms, K-Nearest
    Neighbour, Learned Bayes Net, Naïve Bayesian and
    Support Vector Machine.
  • Support Vector Machine was introduced by
    (Joachims, 1998). Many studies have shown that it
    achieves top-notch performance in text
    classification (Fabrizio, 2002).
  • However, Support Vector Machine does not always
    perform as the best. In this study, we plan to
    compare SVM with the other three baseline
    algorithms.
  • Evaluation
  • For each news article, a classifier's prediction
    is compared with the real category assigned by
    domain experts for evaluation.

18
Evaluation Metrics
  • Standard classification performance metrics used
    in information retrieval machine learning
    studies.
  • Overall correctness
  • Correctness for class i

Class 1 FMD outbreak related Class 2 FMD
control program related Class 3 FMD social and
economic consequences and general information
19
Testbed for Classification
  • There are 2832 news items about FMD collected by
    the FMD Lab (UC Davis). The news are from
    10/06/2004 to 01/20/2007.
  • 1158 pieces of these FMD news havent been
    classified.
  • 1674 FMD news items have been classified into the
    seven categories by domain experts.
  • We map the seven categories of the 1674 pieces of
    news into three categories. Then we use these
    1674 pieces of news to train the classifier.
  • We use 10-fold cross validation as the evaluation
    approach.

20
Hypotheses
  • H1 The expanded representation of of Bag of
    Words, Noun Phrases and Name Entities features
    outperforms the Bag of Words features in FMD news
    classification.
  • H1a FeatureBFS-CombgtFeatureBFS-BW
  • H1b FeatureSFS-CombgtFeatureSFS-BW
  • H2 The selected feature subsets outperform the
    baseline feature subsets in FMD news
    classification.
  • H2a FeatureSFS-BWgtFeatureBFS-BW
  • H2b FeatureSFS-CombgtFeatureBFS-Comb

21
Hypotheses (contd)
  • H3 Support Vector Machine (SVM) outperforms
    K-Nearest Neighbour (KNN), Learned Bayesian Net
    (LBN) and Naïve Bayesian (NB) for selected
    feature subsets in FMD news classification.
  • H3a SVM gt KNN on FeatureSFS-BW
  • H3b SVM gt KNN on FeatureSFS-Comb
  • H3c SVM gt LBN on FeatureSFS-BW
  • H3d SVM gt LBN on FeatureSFS-Comb
  • H3e SVM gt NB on FeatureSFS-BW
  • H3f SVM gt NB on FeatureSFS-Comb

22
Results - Overall Correctness (Accuracy)
23
Hypothesis Testing H1 Representations
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
24
Hypothesis Testing H2 Feature Selection
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
25
Hypothesis Testing H3 Classifiers
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
26
Average Precision, Recall and F-measure
  • SVM on FeatureSFS-Comb achieves the highest
    average precision, recall and F-measure.

27
Conclusions
  • In this study, we have described a general
    framework of building domain specific news
    monitoring and classification system. Our
    experimental study is based on FMD news. However,
    this framework can also be adopted in other
    domains.
  • We have compared the classification performance
    of different machine learning algorithms by using
    different feature sets. We find that SVM achieves
    the highest performance on FMD news when using
    the combination features after conducting feature
    selection.
  • We believe the automatic FMD news monitoring and
    classification system will be helpful for FMD
    surveillance around the world.
Write a Comment
User Comments (0)
About PowerShow.com