Project 2 - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Project 2

Description:

Epos(i) = expected value for attribute i in positive examples ... Diff(i) = Epos(i) - Eneg(i) Weight(i) =Diff(i) / maxj=1..n Diff(j) ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 52

Provided by: LiXu8

Category:

more less

Transcript and Presenter's Notes

Title: Project 2

1
Project 2

CS652

2
Project2

Presented by
REEMA AL-KAMHA

3
Results

VSM model
The training set contains 18 documents (10
positive, 8 negative) .
Ontovector(1 , 0.91 , 1 , 0.85 , 0.90 , 1.16 ,
0.33) corresponds to (Type, GoldAlloy , Price,
Diamondweight, MetalKind, Jem,JemShap)
Threshold0.7
Testing set contains 20 documents (10 positive,
10 negative)
Recall1 , Presion0.8
Testing set for instructor 24 documents
(1positive, 23 negative)
Recall1 , Presion1

4
Results

NB Model
The training set contains 18 documents (10
positive, 8 negative) .
Testing set contains 20 documents (10 positive,
10 negative)
Recall1 , Precision0.7
Testing set for instructor 24 documents
(1positive, 23 negative)
Recall1 , Precision0.5

5
Comments

VSM Model
the average of each attribute the number of
occurrence of the attribute/the number of
records.
NB Model
For vocabulary document remove all stop-words.
In the result I always have Recall1 which means
the process does not discard any relevant
document.

6
Documents classification
Muhammed

Two ways have been implemented in Java.
- VSM Vector Space Model.
- NB Naive Bayes.
Applying VSM on my domain Books- was not without
problems. The problems basically because of the
meaning of the title and the author. For e.g.,
when trying to apply VSM on cars, there are some
thing needed to be figured out such do we
consider the model of the car as title and the
make as author? Of course such assumption made
some troubles since some of irrelevant documents
became relevant.

7
So the philosophy was to ignore the title and
author and use the other attributes to judge if
the document is relevant or not. You can see from
the table that the car almost about to attain the
threshold.
Recall 100 Note when we take the title
and the author Precision100
into consideration the threshold Threshold 76
becomes 0.999.
8
The threshold for Books domain

Other documents similarity
Drug 0.435
Real_estate0.599
Computer0.423

9
Naïve Bayes
10
Conclusion

Both of the implemented ways are efficient.
VSM is easier to implement and faster.
Much time spent because I misunderstood NB
algorithm this was my problem.
When amplifying some key attributes that is
almost unique to a domain, 100 precision and
recall is very possible.
NB is not very sensitive to the boundary values.

11
Tim Chartrand Project 2 Results

Application Domain
Software (Shareware and Freeware)
Size of training set
Positive 10
Negative 10

12
VSM Improvements

Normalize positive training example results to
find the per record expected values
Add a weight to each attribute
Epos(i) expected value for attribute i in
positive examples
Eneg(i) expected value for attribute i in
negative examples
Diff(i) Epos(i) - Eneg(i)
Weight(i) Diff(i) / maxj1..n Diff(j)
Ontology(i) Epos(i) Weight(i)
Weighting results
Average difference improved from 0.587 to 0.714
Separation improved from 0.280 to 0.422
Price given a weight of 0 not considered in
document classification

13
Bayes Results Improvements

Improvements Reduce vocabulary to best words
Eliminate stopwords
Stemm common prefixes and suffixes
Ignore case
Eliminate numbers
Remove non-alphabetic characters before and after
a word

Somewhat artificial result. I started out at
about 10 Precision and added negative training
examples until I correctly classified all test
examples.
14
VSM Vs. Bayes

End results were the same, but
VSM performed better using only the original
dataset
Bayes seems to need more training data (mainly
negative)
Major advantage of VSM Clustering
Using the ontology as a vector allowed effective
clustering of similar data items (i.e. dates,
prices, etc)
Reduced dimensionality from about 1500 to 8

15
Text Classification

Helen Chen
CS652 Project 2
May 31, 2002

16
Documents and Methods

Application movie
Documents
Training Set
Positive Docs 5 Negative Docs 24
My Test Set
Positive Docs 5 Negative Docs 14
Methods VSM model and NB

17
Vector Space Model and Naïve Bayes

VSM threshold is 0.65

Results tested on my own testing set and
instructor-provided testing set
(23 negative docs, 1 positive docs) for VSM model
(left) and NB (right)

18
Comments on VSM

Weighting is critical to performance
Assign weight according to positive examples
Adjust weight according to negative examples

Weights assigned on each attribute
19
Comments on NB

The choice of irrelevant documents in training
set is critical to the performance

Results for Clustered training set
Results for evenly distributed training set
20
Yihongs Project2

Target topic Apartment Rental
Training Sets
5 positive, 10 negative
Testing Sets
Self sets 5 positive, 9 negative
TA sets 1 positive, 23 negative

21
VSM Results

100 Precision and Recall for both self-collected
sets and TA-collected sets
Threshold Value 0.868
Most similar application
Real Estate, range 0.7920.846
Compare with AptRental, range 0.8910.937
Weighting attributes
Precision-weighted
Recall-weighted
F-measure-weighted

22
Naïve Bayes Results

100 Precision and Recall for both self-collected
sets and TA-collected sets
Summation instead of production
to avoid the problem of underflow

23
More Comments

Machine cannot know what is unknown
training examples must be representative
Estimate of prior probability of target values
is very important
50 estimate to 4.2 real distribution is
undesired, precision is 25
33 estimate, achieve 100, over 50 irrelevant
cases pos neg lt 3
Cluster special attributes, like phone number,
price, etc. (similar thinking as our ontology)
Distributional clustering
should work fine because of low noisy level for
semi-structural documents

24
David Marble

CS 652 Spring 2002
Project 2 VSM/Bayes

25
Results (My Test Data)

RECALL PRECISION
VSM
8/10 10/10
80 100
Failed on Classified ads, car ads with a lot of
info.
Bayes
9/10 10/10
90 100
Failed on Missed one restaurant page! That page
had no food description and city names from
outside my training set. FoodType was the key.
(Not too many extraneous documents have the words
mexican, fish, BBQ, chinese, etc. These words
show up on average just over 2 times per record
in the positive training documents.)

26
Results (BYU Data)

RECALL PRECISION
VSM
20/24 24/24
83 100
Failed on Cars, Apartments, Shopping and Real
Estate. Lots of phone s, addresses, cities and
states a name is a given (how can you
distinguish what a restaurant name is?
Bayes
24/24 24/24
100 100
Failed on Nothing. Once again, FoodType was the
key. Luckily, the one applicable document had
food type listed.

27
Comments

Training data contained State Zip only half the
time.
Names of restaurants could not be a specific
term, therefore just about every record had a
restaurant name.
Mainly did well with Naïve Bayes because of
FoodType extraction average of over 2 per
record in training data and covered most of the
possible food terms.

28
VSM and Bayes search results

Lars Olson

29
My Test Data

5 positive, 6 negative (including obituaries)
VSM
Using 83 threshold
Precision 4/5 80
Recall 4/5 80
Using 80 threshold (accepts one training doc
incorrectly)
Precision 5/6 83.3
Recall 5/5 100
Bayes
Precision 5/5 100
Recall 5/5 100

30
TA Test Data

1 positive, 23 negative (including obituaries)
VSM
Precision 1/2 50
Recall 1/1 100
Bayes
Precision 1/1 100
Recall 1/1 100

31
Comments

Obituaries vs. genealogy data?
Rejected by Bayes, but obituary examples in
training set could affect that
Changes VSM to 100 precision and recall for both
test sets at 80 threshold (although one training
doc is still accepted incorrectly)
Incomplete lexicons
High variance (Gender 0.7 to 100, Place 0
to 84.3 in training documents)
Zero vector undefined in VSM

32
Craig Parker
33
My Results

VSM
cut-off value 0.85
100 correct
Bayesian
Classified everything as a non-drug

34
DEG Results

VSM
100 Correct using predetermined cutoff value of
0.85 (I think)
Bayesian
Identified everything as negative (although the
margin was smaller on drug than on non-drugs)

35
Comments

VSM worked very well for drugs.
Would have been even better with a cleaner
dictionary of drug names.
Dose and Form were the most important
distinguishers
Something wrong with my Bayesian calcuations

36
Project 2 - Radio Controlled Cars

Jeff Roth

37
Results - My Tests
38
(No Transcript)
39
Comments

Digital Camera always positive, even out scored
RC Car adds on VSM - lots of matches on battery
and charger
Both algorithms had trouble with very unrelated
documents - docs where almost no term matches
found
Naïve Bayes had most trouble when test set wasnt
similar to RCCars or any of the documents used in
the training set
Combining VSM with NB using a logical AND was
very successful

40
VSM results

Weight i ni / N - n-i / N-

Weight i ni / N
Threshold avg(sim()) avg(sim(-)) 0.61

41
Traditional VSM vs. Onto. VSM

Consider not only attributes, but values
Achieve keyword clustering
Find a way that can automatically and efficiently
define the query words

42
Naïve Bayes Results

Bayes
Requires relevant large number of training set,
especially for the (-) Set
Requires good distribution of the training set

Improvement
Eliminate Stopwords (obtained from
http//www.oac.cdlib.org/help/stopwords.html
Ignore case

43
Conclusion

Both work fine
Naïve Bayes More picky to training set, but not
depend on the pre-defined keyword or the ontology
VSM Application dependent, perform better,
provide relevance rank

44
Finding documents about campgrounds

Alan Wessman

45
Results for My Test Set

VSM
Precision 100
Recall 100
F-measure 100
Classification threshold value 0.660

Naïve Bayes
Precision 86
Recall 100
F-measure 92

46
Results for Class Test Set

VSM
Precision 20 (1/5)
Recall 100 (1/1)
F-measure 33

Naïve Bayes
Precision 20 (1/5)
Recall 100 (1/1)
F-measure 33

47
Observations

Calculating precision in NB product of many
small probabilities becomes zero
NB Accuracy affected by number and percentage of
tokens found in vocabulary
VSM Accuracy strongly affected by how similarly
the different documents support the ontology
VSM Choosing a higher threshold (0.730) would
have given F 75 for my test set and F 66
for the class test set

48
Text ClassificationCS652 Project 2

Yuanqiu (Joe) Zhou

49
Vector Space Model

Query Vector (based on 34 records)
constructed by a document with 34 records
Brand (1.0), Model (1.0), CCDResolution (1.0),
ImageResolution (0.65), OpticalZoom (1.0),
DigitalZoom (0.88)
Threshold 0.92
Obtained by computing the similarities of two
relevant documents(0.99, 0.98)and two similar
documents(0.74, 0.83) to the query
Document Vectors
Self-collected
5 positive (gt 0.97) and 5 negative (lt 0.89)
Recall 100 and Precision 100
TA-proivded
Positive( 0.99) Negative(lt 0.58)
Recall 100 and Precision 100

50
Naïve Bayes Classifier

Training Set
20 positive
28 negative (20 of them very similar)
Testing Set
Self-collected
10 positive
15 negative (10 of them very similar)
Ra 8, R 10, A 12,
Recall 80, Precision 66
TA-provided
Ra 1, R 1, A 1
Recall 100, Precision 100

51
Comments

VSM model results in high recall and precision if
and only if onto demo can extract desired value
correctly
The original Naïve Bayes Classifier has trouble
to classify some pages in special cases and needs
to be fine tuned in some ways (stop words,
positive word density, etc)

Write a Comment

User Comments (0)