Mathematical Programming in Support Vector Machines - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Mathematical Programming in Support Vector Machines

Description:

Sport : NBA coaches' latest weapon. Toronto Raptors (Why Amazon Know What You Need? ... Score. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18 ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 56
Provided by: olvilman
Category:

less

Transcript and Presenter's Notes

Title: Mathematical Programming in Support Vector Machines


1
We Are Overwhelmed with Data
  • Data collecting and storing are no longer
  • expensive and difficult tasks.
  • Earth-orbiting satellite transmits terabytes of
  • data every day
  • New York Stock Exchanges processes, on
  • average, 333 million transactions per day
  • The gap between the generation of data
  • and our understanding of it is growing
  • explosively

2
It Is Not Enough to Store Raw Data
  • Raw data only record the fact but nothing else
  • We want to learn something from the raw data
  • We want the hidden information or knowledge
  • in raw data
  • The idea of an easily accessible, individually
  • configurable storehouse of knowledge, the
  • beginning of the literature on mechanized
  • information retrieval.

3
What is Data Mining?
(Get the Patterns that You Cant See)
  • Computer makes it possible.

4
Data Mining Applications
(Why Amazon Know What You Need?)
  • Consumer behavior analysis
  • Market basket analysis
  • Consumer relation management (CRM)
  • Consumers loyalty
  • Disease diagnosis and prognosis
  • Drug discovery
  • Bioinformatics
  • Microarray gene expression data analysis
  • Sport NBA coaches latest weapon
  • Toronto Raptors

5
Can Computer Read? Text Classification Web
Mining
  • Publication has been extended far beyond
  • our present ability to make real use of the
  • record
  • V. Bush, As we may think, Atlantic Monthly,
    176(1945),
  • p.101-108

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
The Beginning of Information Retrieval Memex
The idea of an easily accessible, individually
configurable storehouse of knowledge, the
beginning of the literature on mechanized
information retrieval
Vannevar Bush (1945)
10
Digital Information Repositories
  • World Wide Web (gt10 billion documents)
  • Digital Libraries (e.g. CDL)
  • Special purpose content providers (e.g. Lexis
    Nexis)
  • Company intranets and digital assets
  • Scientific literature libraries (e.g. citeseer)
  • Medical information portals (e.g. MedlinePlus)
  • Patent databases (e.g. US Patent Office)
  • ...

11
(No Transcript)
12
Information Retrieval
  • Core problem of information retrieval (Maron
    Kuhns 1960)
  • Facets
  • Query-based search
  • Categorizing and annotating documents
  • Organizing and managing information repositories
  • Assessing the quality and authoritativeness of
    information sources
  • Understanding a users information need

Adequately identifying the information content of
documentary data
13
Machine Learning Information Retrieval
  • Information content is not directly observable
  • Relevant entities such as concepts, topics,
    categories, named entities, user interests, etc.
    need to be inferred based on indirect evidence
  • Observables (raw data) text, annotations,
    document structure, link structure, user actions
    ratings, ...
  • Inference problem automatically infer
    information content based on raw data

14
IR vs. RDBMS
  • RDBMS (relational database management system)
  • Semantics of each object are well-defined
  • Complex query languages
  • You get exactly what you ask for
  • Emphasis is on efficiency, ACID properties (i.e.,
    automicity, consistency, isolation, durability)
  • IR
  • Semantics of each object ill-defined people
    dont agree
  • Usually simple query languages
  • You get what you want, even if you asked for it
    badly

15
IR vs. RDBMS
  • Merging trends
  • RDBMS ? IR
  • Answer fuzzy questions
  • e.g., find important publications on quantum
    mechanism around 1945?
  • IR ? RDBMS
  • Information extraction
  • unstructured data ? structured data
  • Semi-structured data, e.g., XML files

16
IR Trend
From the Jamie Callans lecture slide
17
Preprocessing of Text Classification Convert
documents into data input
  • Stopwords such as a, an, the, be, on
  • Eliminating stopwords will reduce space and
  • improving performance
  • Polysemy (????) can verb noun
  • to be or not to be
  • Stemming or Conflation using Porters algorithm
  • Stemming increases the number of documents
  • in the response but also irrelevant documents

18
A School of Fish
19
Fundamental Problems in Data Mining
  • Classification problems (Supervised learning)
  • Test classifier on fresh data to evaluate success
  • Classification results are objective
  • Decision trees, neural networks, support vector
  • machines, k-nearest neighbor, Naive Bayes,
    etc.
  • Linear nonlinear regression

20
Fundamental Problems in Data Mining
  • Feature selection Dimension Reduction
  • Occams razor the simplest is the best
  • Too many features could degrade generalization
  • performance, curse of dimensionality
  • Clustering (Unsupervised learning)
  • Clustering results are subjective
  • k-means algorithm and k-medians algorithm
  • Association rules
  • Minimum support and confidence are required

21
Binary Classification ProblemLearn a Classifier
from the Training Set
Given a training dataset
Main goal Predict the unseen class label for new
data
22
Binary Classification ProblemLinearly Separable
Case
Benign
Malignant
23
Support Vector Machines for ClassificationMaximiz
ing the Margin between Bounding Planes
24
Summary the Notations
25
Robust Linear Programming
Preliminary Approach to SVM
26
(No Transcript)
27
Support Vector Machine Formulations
(Two Different Measures of Training Error)
2-Norm Soft Margin
1-Norm Soft Margin
  • Margin is maximized by minimizing reciprocal of
  • margin.

28
Tuning ProcedureHow to determine C?
C
The final value of parameter is one with the
maximum testing set correctness !
29
Two-spiral Dataset(94 White Dots 94 Red Dots)
30
Support Vector Regression
Original Function
Estimated Function
Parameter
MAE of 49x49 mesh points 0.0513
Training time 22.58 sec.
31
Naïve Bayes for Classification Problem Good for
Binary as well as Multi-category
  • Let each attribute be a random variable. What is
    the probability of the class given an instance?
  • Naïve Bayes assumptions
  • The importance of each attribute is equal
  • All attributes are independent !

32
The Weather Data Example
Ian H. Witten Eibe Frank, Data Mining
33
Probabilities for the Weather Data
Using Frequencies to Approximate Probabilities
34
The Zero-frequency Problem ????,???????!
  • What if an attribute value does NOT occur with a
  • class value?
  • The posterior probability will all be zero! No
    matter
  • how likely the other attribute values are!

Q ???8?,????2, 5, 6, 2, 1, 5, 3, 6??? ?????
4 ??????
35
How to Evaluate Whats Been Learned? Cost is NOT
sensitive
  • Measure the performance of a classifier in term
    of
  • error rate or accuracy

Main goal Predict the unseen class label for new
data
  • We have to asses a classifiers error rate on a
    set
  • that play no role in the learning process
  • Split the data instances in hand into two parts
  • Training set for learning the classifier
  • Testing set for evaluating the classifier

36
k-fold (Stratified) Cross Validation Maximize
the usage of the data in hands
  • Split the data into k approximately equal
    partitions
  • Each in turn is used for testing while the
    remainder
  • is used for training
  • The labels (/-) in the training and testing
    sets
  • should be in about right proportion
  • Doing the random splitting in positive class and
  • negative class respectively will guarantee it
  • This procedure is called stratification
  • Leave-one-out cross-validation if k of data
    point
  • No random sampling is involved but nonstratified

37
How to Compare Two Classifiers? Testing
Hypothesis Paired t-test
  • We compare two learning algorithms by comparing
  • the average error rate over several
    cross-validations
  • Assume the same cross-validation split can be
    used
  • for both methods

38
How to Evaluate Whats Been Learned? When cost is
sensitive
  • Two types error will occur
  • False Positives (FP) False
    Negatives (FN)

39
ROC CurveReceiver Operating Characteristic Curve
  • An evaluation method for learning models.
  • What it concerns about is the Ranking of
    instances made by the learning model.
  • A Ranking means that we sort the instances w.r.t.
    the probability of being a positive instance from
    high to low.
  • ROC curve plots the true positive rate (TPr) as a
    function of the false positive rate (FPr).

40
An Example of ROC Curve
41
Using ROC to Compare Two Methods
Under the same FP rate, method A is better than
method B
42
What if There is a Tie?
Which one is better?
43
Area under the Curve (AUC)
  • An index of ROC curve with range from 0 to 1.
  • An AUC value of 1 corresponds to a perfect
    Ranking (all positive instance are ranked high
    than all negative instance).
  • A simple formula for calculating AUC

44
Performance Measures in Information Retrieval
(IR)
  • An IR system, such as Google, for given a query
  • (key words search) will try to retrieve all
    relevant
  • documents in a corpus
  • Documents returned that are NOT relevant FP
  • The relevant documents that are NOT return FN

45
Balance the Tradeoff between Recall and
Precision F-measure
  • Two extreme cases
  • Return only one document with 100 confidence
  • then precision1 but recall will be very small
  • Return all documents in the corpus then
  • recall1 but precision will be very small

46
Curse of Dimensionality Deal with High
Dimensional Datasets
  • Learning in very high dimensions with very few
  • samples. For example, microarray dataset
  • Colon cancer dataset
  • 2000 of gene vs. 62 samples
  • Acute leukemia dataset
  • 7129 of gene vs. 72 samples
  • In text mining, there are many useless words
    which
  • were called stopwords, such as is, I, and
  • Feature selection will be needed

47
Feature Selection Filter Model Using Fisher-like
Score Approach
Feature 1
Feature 2
Feature 3
48
Weight Score Approach
Weight score
where
and
are the mean and standard deviation of
feature for training examples of positive or
negative class.
49
Two-class classification vs. Binary Attribute
  • Test the class label and a single attribute if
    they
  • are significantly correlated with each
    other

50
Mutual Information (MI)
51
Ranking Features for Classification Filter Model
  • The perfect feature selection should consider
    all
  • possible subsets of features
  • For each subset train and test a classifier
  • Retain the subset that resulted in the highest
  • accuracy (computational infeasible)
  • Measure the discrimination ability of each
    feature
  • Rank the features w.r.t. this measure and select
  • top p features
  • Highly linear correlated features might be
    selected

52
Reuters-2157821578 docs 27000 terms, and 135
classes
  • 21578 documents
  • 1-14818 belong to training set
  • 14819-21578 belong to testing set
  • Reuters-21578 includes 135 categories by using
    ApteMod version of the TOPICS set
  • Result in 90 categories with 7,770 training
    documents and 3,019 testing documents

53
Preprocessing Procedures (cont.)
  • After Stopwords Elimination
  • After Porter Algorithm

54
Binary Text Classificationearn() vs. acq(-)
  • Select top 500 terms using mutual information
  • Evaluate each classifier using F-measure
  • Compare two classifiers using 10-fold paired-t
    test

55
10-fold Testing ResultsRSVM vs. Naïve Bayes
There is no difference between RSVM and NB
Write a Comment
User Comments (0)
About PowerShow.com