Title: GammaWare Technology June 2002
1GammaWare TechnologyJune 2002
- Yiftach Ravid, VP RD
- GammaSite Inc.
- yiftach_at_GammaSite.com
2Overview
- The challenge
- Taxonomies
- Classification
- Categorization
- Focused Crawler
- QA
3The challenge Generate Structured Taxonomies of
text repositories
Business, Relevant Content
Unstructured Data
Internal DB
Information
Word
Structured Data
Application
Web
Forms
XML
Services
Catalogues
Mail
Domino
- Generate a structured taxonomy of huge text
repositories
4Taxonomy
5What is a Taxonomy
- Taxonomy
- Taxis arrangement or division
- Nomos law
- The science of classification according to a
pre-determined system - Best-known use of taxonomy is in Biology
- taxonomies of animals and plants
6Web Taxonomy
- Best-known use of taxonomies
- Web portals or Directories
- Internet sites classified into hierarchical
topics - General
- Yahoo! http//www.yahoo.com/
- Open Directory http//www.dmoz.org/
- LookSmart http//www.looksmart.com/r?countryuk
- Topical
- Business.Com http//www.business.com/
- HealthWeb http//www.healthweb.org/
- Education Planet http//www.educationplanet.com/
7Taxonomy - Sample
8Taxonomy vs. Thesaurus
9Classification
10What is a Classifier
- Concept (Topic, Subject)
- An abstract or generic idea generalized from
particular instances Merriam Webster - Classifier
- A function on a concept (category) and on an
object (document) - Returns a number between 0 and 1 called
confidence rate - Confidence rate measuring the confidence that
the object (document) belongs (should be
classified) to the concept (category)
11Methods for Automatic Classification
- Rule based
- Pre-defined set of rules
- Advantage
- incorporating prior knowledge
- Disadvantages
- extreme reliance on man-made rules
- costly in terms of man-hours
- Linguistics
- Use of morphology, syntax and semantics
- Not Multi lingual, demands many training examples
- Machine Learning
12What is Machine Learning
- Machine Learning is the study of computer
algorithms that automatically improve performance
through experience
13Sample for Machine Learning
14Discriminating Features
Q1 Who is this person? Q2 What are the most
discriminating features?
15Discriminating Features
16Discriminating Features
The Margaret Thatcher effect
17Supervised Inductive Learning
- A process where
- A learning algorithm is provided with a set of
labeled instances, positive and negative examples
(a training set) - Using the training set the leaning algorithm
generates a classifier - The quality of the classifier is measured via its
ability to perform well on novel instances (a
test set)
18Supervised Inductive Learning Example
Training
Test
19Evaluating a Classifier
Classifier
Category
20Recall and Precision
Use a confusion matrix to count
Precision (P) GY / (GY GN) 70 / (7050)
0.58
Recall (R) GY / (GY BY) 70 / (7030)
0.70
Accuracy (A) (GYNN)/(GYGNBYBN) 220 / 300
0.73
F-measure (F) 2/(1/P 1/R)
2GY/(GYGNGYBY) 270/(100120) 0.63
21Supervised Statistical Machine Learning
- A Supervised Inductive Learning method that is
based on statistics obtained from the training
set - Benefits
- Generality and flexibility
- Successfully applied across a broad spectrum of
problems - Multi lingual
- Low labor costs
22How to Classify documents
- Pre defined fields ( Structured data )
- Author
- Title
- Date
- Content ( Unstructured data )
- From title, main text, emphasized text
- All words
- All 2 words, All 3 words, etc.
- Phrases, Synonyms, etc.
23Getting Started
24GammaWare Work Flow
Improve Classifiers
Check Seed
25Requirements
- Initial parameters and decisions
- Level of percolation - affects
- Recall
- Precision
- Multi label
- Maximum number of categories into which a
document can be classified - Types of training documents
- Full text, Keywords
- Different types per category
- List of Stop Words
- Common words in the used language and also in
topic
26Taxonomy
- A Taxonomy is constructed according to
- User\Business needs
- who will be using the taxonomy
- Data
- content of documents for classification
- Good taxonomy
- requires critical attention to both the
definition and application of categories and
their labels - simple and intuitive
- How Using the Expert Tool
27Seeding process
- Seeding process each category within the
taxonomy needs to be given a few examples of
relevant documents of the same type that the user
seeks to catalog - An average of 3-6 relevant documents per category
- Seeds can either be positive seeds or negative
seeds for each category - For better results - training documents should be
in a similar structure as the documents for
classification - How Using the Expert Tool
28Check Seed
- Check seed Classify the seeds into the taxonomy
- Output An HTML page (browsed by the Expert tool)
- For each category shows the cataloging results
for all the relevant seeds. - Why Help in locating seeding problems
- Seeds that are multi labeled
- Problems in taxonomy structure
- How Using the GammaWare Manager
29Train Classifiers
- Train Train classifiers for all categories
- Output A classifier file (gcl extension) for
each category - Why The classifiers are used for categorization.
- How Using the GammaWare Manager
30Classify Documents
- Categorization Catalogue documents into a
Taxonomy - Output A table in a database
- Why This is why we are here.
- How Using the GammaWare Manager
31Improve Classifiers
- Methods to improve classification results using
the Expert Tool. - Re-design the taxonomy
- Seed problems
- More examples
- Add new seeds
- drag and drop documents from classification view
- Negative seeds
- Modify Categorization and Train parameters
32Categorization
33Hierarchical Categorization
- Goal Classify a document into the appropriate
sub-topic(s) in the taxonomy
- Difficulties
- Many sub-topics
- A document may fall into several sub-topics
- Classifiers are not perfect
- Must control Recall and Precision according
to the clients needs
34Hierarchical Categorization
- Divide and Conquer solution
- Solve the problem Level by Level
- At each level decompose the problem into several,
smaller sized classification sub-problems - Note ignoring interactions between sub-problems
can yield poor results
- Patent Pending on Categorization
35Focused Crawler
36Topic Specific Crawling
- Retrieve all documents that are relevant to a
specific topic of interest
- Hyper-linked networks (Intranet, Internet)
- Two options
- Crawl the network. Then apply classification
schemes to filter relevant documents. - Using classification schemes crawl the network
while teaching the crawler to imitate
(intelligent) human surfing strategies
37Simple Crawling
- The Network is huge
- Storage
- Network
- Time
- Good for general-purpose search engines
- Crawling The process of retrieving documents
from the net
38Focused Crawling via Link Classifiers
- Analyze the context of the link
Herbal tea specialist
My brother new born child
Link is irrelevant
- Link classifier Decision according to the
context of the link
39Focused Crawler The Learning Process
Herbal tea specialist
- Crawler Classifier Checks if the document is
good for Crawling
40GammaWare API
41Architecture - Basic
42Multiple Servers
- Scalability and Availability
43Q A