Title: Project 2
1Project 2
- Presented by
- VSM model
- The training set contains 18 documents (10
positive, 8 negative) . - Ontovector(1 , 0.91 , 1 , 0.85 , 0.90 , 1.16 ,
0.33) corresponds to (Type, GoldAlloy , Price,
Diamondweight, MetalKind, Jem,JemShap) - Threshold0.7
- Testing set contains 20 documents (10 positive,
10 negative) - Recall1 , Presion0.8
- Testing set for instructor 24 documents
(1positive, 23 negative) - Recall1 , Presion1
- NB Model
- The training set contains 18 documents (10
positive, 8 negative) . - Testing set contains 20 documents (10 positive,
10 negative) - Recall1 , Precision0.7
- Testing set for instructor 24 documents
(1positive, 23 negative) - Recall1 , Precision0.5
- VSM Model
- the average of each attribute the number of
occurrence of the attribute/the number of
records. - NB Model
- For vocabulary document remove all stop-words.
- In the result I always have Recall1 which means
the process does not discard any relevant
6Documents classification
- Two ways have been implemented in Java.
- - VSM Vector Space Model.
- - NB Naive Bayes.
- Applying VSM on my domain Books- was not without
problems. The problems basically because of the
meaning of the title and the author. For e.g.,
when trying to apply VSM on cars, there are some
thing needed to be figured out such do we
consider the model of the car as title and the
make as author? Of course such assumption made
some troubles since some of irrelevant documents
became relevant.
7So the philosophy was to ignore the title and
author and use the other attributes to judge if
the document is relevant or not. You can see from
the table that the car almost about to attain the
Recall 100 Note when we take the title
and the author Precision100
into consideration the threshold Threshold 76
becomes 0.999.
8The threshold for Books domain
- Other documents similarity
- Drug 0.435
- Real_estate0.599
- Computer0.423
9Naïve Bayes
- Both of the implemented ways are efficient.
- VSM is easier to implement and faster.
- Much time spent because I misunderstood NB
algorithm this was my problem. - When amplifying some key attributes that is
almost unique to a domain, 100 precision and
recall is very possible. - NB is not very sensitive to the boundary values.
11Tim Chartrand Project 2 Results
- Application Domain
- Software (Shareware and Freeware)
- Size of training set
- Positive 10
- Negative 10
12VSM Improvements
- Normalize positive training example results to
find the per record expected values - Add a weight to each attribute
- Epos(i) expected value for attribute i in
positive examples - Eneg(i) expected value for attribute i in
negative examples - Diff(i) Epos(i) - Eneg(i)
- Weight(i) Diff(i) / maxj1..n Diff(j)
- Ontology(i) Epos(i) Weight(i)
- Weighting results
- Average difference improved from 0.587 to 0.714
- Separation improved from 0.280 to 0.422
- Price given a weight of 0 not considered in
document classification
13Bayes Results Improvements
- Improvements Reduce vocabulary to best words
- Eliminate stopwords
- Stemm common prefixes and suffixes
- Ignore case
- Eliminate numbers
- Remove non-alphabetic characters before and after
a word
Somewhat artificial result. I started out at
about 10 Precision and added negative training
examples until I correctly classified all test
14VSM Vs. Bayes
- End results were the same, but
- VSM performed better using only the original
dataset - Bayes seems to need more training data (mainly
negative) - Major advantage of VSM Clustering
- Using the ontology as a vector allowed effective
clustering of similar data items (i.e. dates,
prices, etc) - Reduced dimensionality from about 1500 to 8
15Text Classification
- Helen Chen
- CS652 Project 2
- May 31, 2002
16Documents and Methods
- Application movie
- Documents
- Training Set
- Positive Docs 5 Negative Docs 24
- My Test Set
- Positive Docs 5 Negative Docs 14
- Methods VSM model and NB
17Vector Space Model and Naïve Bayes
- Results tested on my own testing set and
instructor-provided testing set - (23 negative docs, 1 positive docs) for VSM model
(left) and NB (right)
18Comments on VSM
- Weighting is critical to performance
- Assign weight according to positive examples
- Adjust weight according to negative examples
Weights assigned on each attribute
19Comments on NB
- The choice of irrelevant documents in training
set is critical to the performance
Results for Clustered training set
Results for evenly distributed training set
20Yihongs Project2
- Target topic Apartment Rental
- Training Sets
- 5 positive, 10 negative
- Testing Sets
- Self sets 5 positive, 9 negative
- TA sets 1 positive, 23 negative
21VSM Results
- 100 Precision and Recall for both self-collected
sets and TA-collected sets - Threshold Value 0.868
- Most similar application
- Real Estate, range 0.7920.846
- Compare with AptRental, range 0.8910.937
- Weighting attributes
- Precision-weighted
- Recall-weighted
- F-measure-weighted
22Naïve Bayes Results
- 100 Precision and Recall for both self-collected
sets and TA-collected sets - Summation instead of production
- to avoid the problem of underflow
23More Comments
- Machine cannot know what is unknown
- training examples must be representative
- Estimate of prior probability of target values
is very important - 50 estimate to 4.2 real distribution is
undesired, precision is 25 - 33 estimate, achieve 100, over 50 irrelevant
cases pos neg lt 3 - Cluster special attributes, like phone number,
price, etc. (similar thinking as our ontology) - Distributional clustering
- should work fine because of low noisy level for
semi-structural documents
24David Marble
- CS 652 Spring 2002
- Project 2 VSM/Bayes
25Results (My Test Data)
- 8/10 10/10
- 80 100
- Failed on Classified ads, car ads with a lot of
info. - Bayes
- 9/10 10/10
- 90 100
- Failed on Missed one restaurant page! That page
had no food description and city names from
outside my training set. FoodType was the key.
(Not too many extraneous documents have the words
mexican, fish, BBQ, chinese, etc. These words
show up on average just over 2 times per record
in the positive training documents.)
26Results (BYU Data)
- 20/24 24/24
- 83 100
- Failed on Cars, Apartments, Shopping and Real
Estate. Lots of phone s, addresses, cities and
states a name is a given (how can you
distinguish what a restaurant name is? - Bayes
- 24/24 24/24
- 100 100
- Failed on Nothing. Once again, FoodType was the
key. Luckily, the one applicable document had
food type listed.
- Training data contained State Zip only half the
time. - Names of restaurants could not be a specific
term, therefore just about every record had a
restaurant name. - Mainly did well with Naïve Bayes because of
FoodType extraction average of over 2 per
record in training data and covered most of the
possible food terms.
28VSM and Bayes search results
29My Test Data
- 5 positive, 6 negative (including obituaries)
- Using 83 threshold
- Precision 4/5 80
- Recall 4/5 80
- Using 80 threshold (accepts one training doc
incorrectly) - Precision 5/6 83.3
- Recall 5/5 100
- Bayes
- Precision 5/5 100
- Recall 5/5 100
30TA Test Data
- 1 positive, 23 negative (including obituaries)
- Precision 1/2 50
- Recall 1/1 100
- Bayes
- Precision 1/1 100
- Recall 1/1 100
- Obituaries vs. genealogy data?
- Rejected by Bayes, but obituary examples in
training set could affect that - Changes VSM to 100 precision and recall for both
test sets at 80 threshold (although one training
doc is still accepted incorrectly) - Incomplete lexicons
- High variance (Gender 0.7 to 100, Place 0
to 84.3 in training documents) - Zero vector undefined in VSM
32Craig Parker
33My Results
- cut-off value 0.85
- 100 correct
- Bayesian
- Classified everything as a non-drug
34DEG Results
- 100 Correct using predetermined cutoff value of
0.85 (I think) - Bayesian
- Identified everything as negative (although the
margin was smaller on drug than on non-drugs)
- VSM worked very well for drugs.
- Would have been even better with a cleaner
dictionary of drug names. - Dose and Form were the most important
distinguishers - Something wrong with my Bayesian calcuations
36Project 2 - Radio Controlled Cars
37Results - My Tests
38(No Transcript)
- Digital Camera always positive, even out scored
RC Car adds on VSM - lots of matches on battery
and charger - Both algorithms had trouble with very unrelated
documents - docs where almost no term matches
found - Naïve Bayes had most trouble when test set wasnt
similar to RCCars or any of the documents used in
the training set - Combining VSM with NB using a logical AND was
very successful
40VSM results
- Weight i ni / N - n-i / N-
- Weight i ni / N
- Threshold avg(sim()) avg(sim(-)) 0.61
41Traditional VSM vs. Onto. VSM
- Consider not only attributes, but values
- Achieve keyword clustering
- Find a way that can automatically and efficiently
define the query words
42Naïve Bayes Results
- Bayes
- Requires relevant large number of training set,
especially for the (-) Set - Requires good distribution of the training set
- Improvement
- Eliminate Stopwords (obtained from
http//www.oac.cdlib.org/help/stopwords.html - Ignore case
- Both work fine
- Naïve Bayes More picky to training set, but not
depend on the pre-defined keyword or the ontology - VSM Application dependent, perform better,
provide relevance rank
44Finding documents about campgrounds
45Results for My Test Set
- Precision 100
- Recall 100
- F-measure 100
- Classification threshold value 0.660
- Naïve Bayes
- Precision 86
- Recall 100
- F-measure 92
46Results for Class Test Set
- Precision 20 (1/5)
- Recall 100 (1/1)
- F-measure 33
- Naïve Bayes
- Precision 20 (1/5)
- Recall 100 (1/1)
- F-measure 33
- Calculating precision in NB product of many
small probabilities becomes zero - NB Accuracy affected by number and percentage of
tokens found in vocabulary - VSM Accuracy strongly affected by how similarly
the different documents support the ontology - VSM Choosing a higher threshold (0.730) would
have given F 75 for my test set and F 66
for the class test set
48Text ClassificationCS652 Project 2
49Vector Space Model
- Query Vector (based on 34 records)
- constructed by a document with 34 records
- Brand (1.0), Model (1.0), CCDResolution (1.0),
ImageResolution (0.65), OpticalZoom (1.0),
DigitalZoom (0.88) - Threshold 0.92
- Obtained by computing the similarities of two
relevant documents(0.99, 0.98)and two similar
documents(0.74, 0.83) to the query - Document Vectors
- Self-collected
- 5 positive (gt 0.97) and 5 negative (lt 0.89)
- Recall 100 and Precision 100
- TA-proivded
- Positive( 0.99) Negative(lt 0.58)
- Recall 100 and Precision 100
50Naïve Bayes Classifier
- Training Set
- 20 positive
- 28 negative (20 of them very similar)
- Testing Set
- Self-collected
- 10 positive
- 15 negative (10 of them very similar)
- Ra 8, R 10, A 12,
- Recall 80, Precision 66
- TA-provided
- Ra 1, R 1, A 1
- Recall 100, Precision 100
- VSM model results in high recall and precision if
and only if onto demo can extract desired value
correctly - The original Naïve Bayes Classifier has trouble
to classify some pages in special cases and needs
to be fine tuned in some ways (stop words,
positive word density, etc)