An Artificial Immune Based Approach to Semantic Document Classification - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

An Artificial Immune Based Approach to Semantic Document Classification

Description:

Yahoo pages organised into a taxonomy. MUSIC. POPULAR. CLASSICAL. DANCE. ROCK. CLASSICAL ... Yahoo taxonomies business pages. 2 classes. Small, medium & large ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 13
Provided by: ste74
Category:

less

Transcript and Presenter's Notes

Title: An Artificial Immune Based Approach to Semantic Document Classification


1
An Artificial Immune Based Approach to Semantic
Document Classification
  • Julie Greensmith Steve Cayzer
  • BICAS research group
  • Hewlett-Packard Laboratories, Bristol
  • (University of Leeds)
  • julie.greensmith_at_hp.com
  • September 2003

2
The Problem Information Overload
  • Wealth of Internet resources
  • Googlefact August 2003, searches approximately
    3 BILLION PAGES
  • Drowning in data
  • Need to add structure to the information

3
Current Solutions
  • Use metadata to describe meaningful relationships
  • Yahoo pages organised into a taxonomy

MUSIC
POPULAR
CLASSICAL
DANCE
ROCK
CLASSICAL
BAROQUE
4
HOWEVER
  • Requires manual annotation i.e. effort
  • Markup of both new and existing pages
  • Placing of documents into taxonomies subjective
  • Need a personalised system to add structure

5
So, what should we do about it?
  • Automatic Documents Classification
  • Less labour intensive
  • Multiple Taxonomy Mapping
  • Number of potential approaches
  • Decision tree
  • Naïve Bayesian
  • Use a more novel computational approach namely
    an artificial immune system as a document
    classifier

6
AIRS and Document Classification
  • AIRS
  • Developed by Andrew Watkins
  • B-Cell inspired, resource limited algorithm
  • K-nearest neighbour
  • Supervised learning paradigm
  • Multi-class classification
  • Good performance on other data mining tasks
  • What we would like to achieve
  • Push AIRS into the novel domain of document
    classification, and more significantly, semantic
    document classification

7
So, how are we going to achieve this?
  • Creation of feature vectors
  • Extraction simple text processing
  • Selection information gain
  • Representation boolean feature representation
  • Present feature vectors to classifier system
  • Measure predictive accuracy
  • Hierarchical placing of documents
  • Multiple Taxonomy mapping
  • Placement in taxonomy to derive semantic features
  • Semantic similarity measures
  • Personalised Semantic Structure

8
Work performed so far
  • Document classification
  • Yahoo taxonomies business pages
  • 2 classes
  • Small, medium large datasets
  • Multi-class datasets
  • Varied amount of features
  • Comparison with a naïve Bayesian based system
  • Preliminary Results
  • Promising ?
  • Presented in my MSc Thesis

9
Some Preliminary Figures
  • No significant difference
  • found between 2 class and
  • 4 class dataset
  • AIRS performed significantly better than
    alternative naïve Bayesian system

10
Semantic Antics
  • The AIRS system can classify documents
  • New frontiers for an artificial immune system
  • Perform multiple taxonomy mapping for
    personalisation
  • Use semantic features in conjunction with text
    features to develop hybrid feature vectors
  • Semantic similarity measures
  • This is work to be performed in the near future ?

11
Summary
  • Suffering from data overload
  • Alleviate the solution with hierarchical document
    classification
  • Use AIRS as a document classification tool
  • Achieved promising results on a number of
    datasets
  • Use AIRS as a semantic document classification
    tool
  • Semantic similarity metrics
  • Personalised semantic structure

12
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com