Project 2 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Project 2

Description:

Epos(i) = expected value for attribute i in positive examples ... Diff(i) = Epos(i) - Eneg(i) Weight(i) =Diff(i) / maxj=1..n Diff(j) ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 52
Provided by: LiXu8
Category:
Tags: epos | project

less

Transcript and Presenter's Notes

Title: Project 2


1
Project 2
  • CS652

2
Project2
  • Presented by
  • REEMA AL-KAMHA

3
Results
  • VSM model
  • The training set contains 18 documents (10
    positive, 8 negative) .
  • Ontovector(1 , 0.91 , 1 , 0.85 , 0.90 , 1.16 ,
    0.33) corresponds to (Type, GoldAlloy , Price,
    Diamondweight, MetalKind, Jem,JemShap)
  • Threshold0.7
  • Testing set contains 20 documents (10 positive,
    10 negative)
  • Recall1 , Presion0.8
  • Testing set for instructor 24 documents
    (1positive, 23 negative)
  • Recall1 , Presion1

4
Results
  • NB Model
  • The training set contains 18 documents (10
    positive, 8 negative) .
  • Testing set contains 20 documents (10 positive,
    10 negative)
  • Recall1 , Precision0.7
  • Testing set for instructor 24 documents
    (1positive, 23 negative)
  • Recall1 , Precision0.5

5
Comments
  • VSM Model
  • the average of each attribute the number of
    occurrence of the attribute/the number of
    records.
  • NB Model
  • For vocabulary document remove all stop-words.
  • In the result I always have Recall1 which means
    the process does not discard any relevant
    document.

6
Documents classification
Muhammed
  • Two ways have been implemented in Java.
  • - VSM Vector Space Model.
  • - NB Naive Bayes.
  • Applying VSM on my domain Books- was not without
    problems. The problems basically because of the
    meaning of the title and the author. For e.g.,
    when trying to apply VSM on cars, there are some
    thing needed to be figured out such do we
    consider the model of the car as title and the
    make as author? Of course such assumption made
    some troubles since some of irrelevant documents
    became relevant.

7
So the philosophy was to ignore the title and
author and use the other attributes to judge if
the document is relevant or not. You can see from
the table that the car almost about to attain the
threshold.
Recall 100 Note when we take the title
and the author Precision100
into consideration the threshold Threshold 76
becomes 0.999.
8
The threshold for Books domain
  • Other documents similarity
  • Drug 0.435
  • Real_estate0.599
  • Computer0.423

9
Naïve Bayes
10
Conclusion
  • Both of the implemented ways are efficient.
  • VSM is easier to implement and faster.
  • Much time spent because I misunderstood NB
    algorithm this was my problem.
  • When amplifying some key attributes that is
    almost unique to a domain, 100 precision and
    recall is very possible.
  • NB is not very sensitive to the boundary values.

11
Tim Chartrand Project 2 Results
  • Application Domain
  • Software (Shareware and Freeware)
  • Size of training set
  • Positive 10
  • Negative 10

12
VSM Improvements
  • Normalize positive training example results to
    find the per record expected values
  • Add a weight to each attribute
  • Epos(i) expected value for attribute i in
    positive examples
  • Eneg(i) expected value for attribute i in
    negative examples
  • Diff(i) Epos(i) - Eneg(i)
  • Weight(i) Diff(i) / maxj1..n Diff(j)
  • Ontology(i) Epos(i) Weight(i)
  • Weighting results
  • Average difference improved from 0.587 to 0.714
  • Separation improved from 0.280 to 0.422
  • Price given a weight of 0 not considered in
    document classification

13
Bayes Results Improvements
  • Improvements Reduce vocabulary to best words
  • Eliminate stopwords
  • Stemm common prefixes and suffixes
  • Ignore case
  • Eliminate numbers
  • Remove non-alphabetic characters before and after
    a word

Somewhat artificial result. I started out at
about 10 Precision and added negative training
examples until I correctly classified all test
examples.
14
VSM Vs. Bayes
  • End results were the same, but
  • VSM performed better using only the original
    dataset
  • Bayes seems to need more training data (mainly
    negative)
  • Major advantage of VSM Clustering
  • Using the ontology as a vector allowed effective
    clustering of similar data items (i.e. dates,
    prices, etc)
  • Reduced dimensionality from about 1500 to 8

15
Text Classification
  • Helen Chen
  • CS652 Project 2
  • May 31, 2002

16
Documents and Methods
  • Application movie
  • Documents
  • Training Set
  • Positive Docs 5 Negative Docs 24
  • My Test Set
  • Positive Docs 5 Negative Docs 14
  • Methods VSM model and NB

17
Vector Space Model and Naïve Bayes
  • VSM threshold is 0.65
  • NB
  • Results tested on my own testing set and
    instructor-provided testing set
  • (23 negative docs, 1 positive docs) for VSM model
    (left) and NB (right)

18
Comments on VSM
  • Weighting is critical to performance
  • Assign weight according to positive examples
  • Adjust weight according to negative examples

Weights assigned on each attribute
19
Comments on NB
  • The choice of irrelevant documents in training
    set is critical to the performance

Results for Clustered training set
Results for evenly distributed training set
20
Yihongs Project2
  • Target topic Apartment Rental
  • Training Sets
  • 5 positive, 10 negative
  • Testing Sets
  • Self sets 5 positive, 9 negative
  • TA sets 1 positive, 23 negative

21
VSM Results
  • 100 Precision and Recall for both self-collected
    sets and TA-collected sets
  • Threshold Value 0.868
  • Most similar application
  • Real Estate, range 0.7920.846
  • Compare with AptRental, range 0.8910.937
  • Weighting attributes
  • Precision-weighted
  • Recall-weighted
  • F-measure-weighted

22
Naïve Bayes Results
  • 100 Precision and Recall for both self-collected
    sets and TA-collected sets
  • Summation instead of production
  • to avoid the problem of underflow

23
More Comments
  • Machine cannot know what is unknown
  • training examples must be representative
  • Estimate of prior probability of target values
    is very important
  • 50 estimate to 4.2 real distribution is
    undesired, precision is 25
  • 33 estimate, achieve 100, over 50 irrelevant
    cases pos neg lt 3
  • Cluster special attributes, like phone number,
    price, etc. (similar thinking as our ontology)
  • Distributional clustering
  • should work fine because of low noisy level for
    semi-structural documents

24
David Marble
  • CS 652 Spring 2002
  • Project 2 VSM/Bayes

25
Results (My Test Data)
  • RECALL PRECISION
  • VSM
  • 8/10 10/10
  • 80 100
  • Failed on Classified ads, car ads with a lot of
    info.
  • Bayes
  • 9/10 10/10
  • 90 100
  • Failed on Missed one restaurant page! That page
    had no food description and city names from
    outside my training set. FoodType was the key.
    (Not too many extraneous documents have the words
    mexican, fish, BBQ, chinese, etc. These words
    show up on average just over 2 times per record
    in the positive training documents.)

26
Results (BYU Data)
  • RECALL PRECISION
  • VSM
  • 20/24 24/24
  • 83 100
  • Failed on Cars, Apartments, Shopping and Real
    Estate. Lots of phone s, addresses, cities and
    states a name is a given (how can you
    distinguish what a restaurant name is?
  • Bayes
  • 24/24 24/24
  • 100 100
  • Failed on Nothing. Once again, FoodType was the
    key. Luckily, the one applicable document had
    food type listed.

27
Comments
  • Training data contained State Zip only half the
    time.
  • Names of restaurants could not be a specific
    term, therefore just about every record had a
    restaurant name.
  • Mainly did well with Naïve Bayes because of
    FoodType extraction average of over 2 per
    record in training data and covered most of the
    possible food terms.

28
VSM and Bayes search results
  • Lars Olson

29
My Test Data
  • 5 positive, 6 negative (including obituaries)
  • VSM
  • Using 83 threshold
  • Precision 4/5 80
  • Recall 4/5 80
  • Using 80 threshold (accepts one training doc
    incorrectly)
  • Precision 5/6 83.3
  • Recall 5/5 100
  • Bayes
  • Precision 5/5 100
  • Recall 5/5 100

30
TA Test Data
  • 1 positive, 23 negative (including obituaries)
  • VSM
  • Precision 1/2 50
  • Recall 1/1 100
  • Bayes
  • Precision 1/1 100
  • Recall 1/1 100

31
Comments
  • Obituaries vs. genealogy data?
  • Rejected by Bayes, but obituary examples in
    training set could affect that
  • Changes VSM to 100 precision and recall for both
    test sets at 80 threshold (although one training
    doc is still accepted incorrectly)
  • Incomplete lexicons
  • High variance (Gender 0.7 to 100, Place 0
    to 84.3 in training documents)
  • Zero vector undefined in VSM

32
Craig Parker
33
My Results
  • VSM
  • cut-off value 0.85
  • 100 correct
  • Bayesian
  • Classified everything as a non-drug

34
DEG Results
  • VSM
  • 100 Correct using predetermined cutoff value of
    0.85 (I think)
  • Bayesian
  • Identified everything as negative (although the
    margin was smaller on drug than on non-drugs)

35
Comments
  • VSM worked very well for drugs.
  • Would have been even better with a cleaner
    dictionary of drug names.
  • Dose and Form were the most important
    distinguishers
  • Something wrong with my Bayesian calcuations

36
Project 2 - Radio Controlled Cars
  • Jeff Roth

37
Results - My Tests
38
(No Transcript)
39
Comments
  • Digital Camera always positive, even out scored
    RC Car adds on VSM - lots of matches on battery
    and charger
  • Both algorithms had trouble with very unrelated
    documents - docs where almost no term matches
    found
  • Naïve Bayes had most trouble when test set wasnt
    similar to RCCars or any of the documents used in
    the training set
  • Combining VSM with NB using a logical AND was
    very successful

40
VSM results
  • Weight i ni / N - n-i / N-
  • Weight i ni / N
  • Threshold avg(sim()) avg(sim(-)) 0.61

41
Traditional VSM vs. Onto. VSM
  • Consider not only attributes, but values
  • Achieve keyword clustering
  • Find a way that can automatically and efficiently
    define the query words

42
Naïve Bayes Results
  • Bayes
  • Requires relevant large number of training set,
    especially for the (-) Set
  • Requires good distribution of the training set
  • Improvement
  • Eliminate Stopwords (obtained from
    http//www.oac.cdlib.org/help/stopwords.html
  • Ignore case

43
Conclusion
  • Both work fine
  • Naïve Bayes More picky to training set, but not
    depend on the pre-defined keyword or the ontology
  • VSM Application dependent, perform better,
    provide relevance rank

44
Finding documents about campgrounds
  • Alan Wessman

45
Results for My Test Set
  • VSM
  • Precision 100
  • Recall 100
  • F-measure 100
  • Classification threshold value 0.660
  • Naïve Bayes
  • Precision 86
  • Recall 100
  • F-measure 92

46
Results for Class Test Set
  • VSM
  • Precision 20 (1/5)
  • Recall 100 (1/1)
  • F-measure 33
  • Naïve Bayes
  • Precision 20 (1/5)
  • Recall 100 (1/1)
  • F-measure 33

47
Observations
  • Calculating precision in NB product of many
    small probabilities becomes zero
  • NB Accuracy affected by number and percentage of
    tokens found in vocabulary
  • VSM Accuracy strongly affected by how similarly
    the different documents support the ontology
  • VSM Choosing a higher threshold (0.730) would
    have given F 75 for my test set and F 66
    for the class test set

48
Text ClassificationCS652 Project 2
  • Yuanqiu (Joe) Zhou

49
Vector Space Model
  • Query Vector (based on 34 records)
  • constructed by a document with 34 records
  • Brand (1.0), Model (1.0), CCDResolution (1.0),
    ImageResolution (0.65), OpticalZoom (1.0),
    DigitalZoom (0.88)
  • Threshold 0.92
  • Obtained by computing the similarities of two
    relevant documents(0.99, 0.98)and two similar
    documents(0.74, 0.83) to the query
  • Document Vectors
  • Self-collected
  • 5 positive (gt 0.97) and 5 negative (lt 0.89)
  • Recall 100 and Precision 100
  • TA-proivded
  • Positive( 0.99) Negative(lt 0.58)
  • Recall 100 and Precision 100

50
Naïve Bayes Classifier
  • Training Set
  • 20 positive
  • 28 negative (20 of them very similar)
  • Testing Set
  • Self-collected
  • 10 positive
  • 15 negative (10 of them very similar)
  • Ra 8, R 10, A 12,
  • Recall 80, Precision 66
  • TA-provided
  • Ra 1, R 1, A 1
  • Recall 100, Precision 100

51
Comments
  • VSM model results in high recall and precision if
    and only if onto demo can extract desired value
    correctly
  • The original Naïve Bayes Classifier has trouble
    to classify some pages in special cases and needs
    to be fine tuned in some ways (stop words,
    positive word density, etc)
Write a Comment
User Comments (0)
About PowerShow.com