Mathematical Programming in Support Vector Machines - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Mathematical Programming in Support Vector Machines

Description:

Sport : NBA coaches' latest weapon. Toronto Raptors (Why Amazon Know What You Need? ... Score. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18 ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 56

Provided by: olvilman

Category:

more less

Transcript and Presenter's Notes

Title: Mathematical Programming in Support Vector Machines

1
We Are Overwhelmed with Data

Data collecting and storing are no longer
expensive and difficult tasks.

Earth-orbiting satellite transmits terabytes of
data every day

New York Stock Exchanges processes, on
average, 333 million transactions per day

The gap between the generation of data
and our understanding of it is growing
explosively

2
It Is Not Enough to Store Raw Data

Raw data only record the fact but nothing else

We want to learn something from the raw data

We want the hidden information or knowledge
in raw data

The idea of an easily accessible, individually
configurable storehouse of knowledge, the
beginning of the literature on mechanized
information retrieval.

3
What is Data Mining?
(Get the Patterns that You Cant See)

Computer makes it possible.

4
Data Mining Applications
(Why Amazon Know What You Need?)

Consumer behavior analysis

Market basket analysis

Consumer relation management (CRM)

Consumers loyalty

Disease diagnosis and prognosis

Drug discovery

Bioinformatics

Microarray gene expression data analysis

Sport NBA coaches latest weapon

Toronto Raptors

5
Can Computer Read? Text Classification Web
Mining

Publication has been extended far beyond
our present ability to make real use of the
record

V. Bush, As we may think, Atlantic Monthly,
176(1945),
p.101-108

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
The Beginning of Information Retrieval Memex
The idea of an easily accessible, individually
configurable storehouse of knowledge, the
beginning of the literature on mechanized
information retrieval
Vannevar Bush (1945)
10
Digital Information Repositories

World Wide Web (gt10 billion documents)
Digital Libraries (e.g. CDL)
Special purpose content providers (e.g. Lexis
Nexis)
Company intranets and digital assets
Scientific literature libraries (e.g. citeseer)
Medical information portals (e.g. MedlinePlus)
Patent databases (e.g. US Patent Office)
...

11
(No Transcript)
12
Information Retrieval

Core problem of information retrieval (Maron
Kuhns 1960)
Facets
Query-based search
Categorizing and annotating documents
Organizing and managing information repositories
Assessing the quality and authoritativeness of
information sources
Understanding a users information need

Adequately identifying the information content of
documentary data
13
Machine Learning Information Retrieval

Information content is not directly observable
Relevant entities such as concepts, topics,
categories, named entities, user interests, etc.
need to be inferred based on indirect evidence
Observables (raw data) text, annotations,
document structure, link structure, user actions
ratings, ...
Inference problem automatically infer
information content based on raw data

14
IR vs. RDBMS

RDBMS (relational database management system)
Semantics of each object are well-defined
Complex query languages
You get exactly what you ask for
Emphasis is on efficiency, ACID properties (i.e.,
automicity, consistency, isolation, durability)
IR
Semantics of each object ill-defined people
dont agree
Usually simple query languages
You get what you want, even if you asked for it
badly

15
IR vs. RDBMS

Merging trends
RDBMS ? IR
Answer fuzzy questions
e.g., find important publications on quantum
mechanism around 1945?
IR ? RDBMS
Information extraction
unstructured data ? structured data
Semi-structured data, e.g., XML files

16
IR Trend
From the Jamie Callans lecture slide
17
Preprocessing of Text Classification Convert
documents into data input

Stopwords such as a, an, the, be, on

Eliminating stopwords will reduce space and
improving performance

Polysemy (????) can verb noun

to be or not to be

Stemming or Conflation using Porters algorithm

Stemming increases the number of documents
in the response but also irrelevant documents

18
A School of Fish
19
Fundamental Problems in Data Mining

Classification problems (Supervised learning)

Test classifier on fresh data to evaluate success

Classification results are objective

Decision trees, neural networks, support vector
machines, k-nearest neighbor, Naive Bayes,
etc.

Linear nonlinear regression

20
Fundamental Problems in Data Mining

Feature selection Dimension Reduction

Occams razor the simplest is the best

Too many features could degrade generalization
performance, curse of dimensionality

Clustering (Unsupervised learning)

Clustering results are subjective

k-means algorithm and k-medians algorithm

Association rules

Minimum support and confidence are required

21
Binary Classification ProblemLearn a Classifier
from the Training Set
Given a training dataset
Main goal Predict the unseen class label for new
data
22
Binary Classification ProblemLinearly Separable
Case
Benign
Malignant
23
Support Vector Machines for ClassificationMaximiz
ing the Margin between Bounding Planes
24
Summary the Notations
25
Robust Linear Programming
Preliminary Approach to SVM
26
(No Transcript)
27
Support Vector Machine Formulations
(Two Different Measures of Training Error)
2-Norm Soft Margin
1-Norm Soft Margin

Margin is maximized by minimizing reciprocal of
margin.

28
Tuning ProcedureHow to determine C?
C
The final value of parameter is one with the
maximum testing set correctness !
29
Two-spiral Dataset(94 White Dots 94 Red Dots)
30
Support Vector Regression
Original Function
Estimated Function
Parameter
MAE of 49x49 mesh points 0.0513
Training time 22.58 sec.
31
Naïve Bayes for Classification Problem Good for
Binary as well as Multi-category

Let each attribute be a random variable. What is
the probability of the class given an instance?

Naïve Bayes assumptions

The importance of each attribute is equal

All attributes are independent !

32
The Weather Data Example
Ian H. Witten Eibe Frank, Data Mining
33
Probabilities for the Weather Data
Using Frequencies to Approximate Probabilities
34
The Zero-frequency Problem ????,???????!

What if an attribute value does NOT occur with a
class value?

The posterior probability will all be zero! No
matter
how likely the other attribute values are!

Q ???8?,????2, 5, 6, 2, 1, 5, 3, 6??? ?????
4 ??????
35
How to Evaluate Whats Been Learned? Cost is NOT
sensitive

Measure the performance of a classifier in term
of
error rate or accuracy

Main goal Predict the unseen class label for new
data

We have to asses a classifiers error rate on a
set
that play no role in the learning process

Split the data instances in hand into two parts

Training set for learning the classifier

Testing set for evaluating the classifier

36
k-fold (Stratified) Cross Validation Maximize
the usage of the data in hands

Split the data into k approximately equal
partitions

Each in turn is used for testing while the
remainder
is used for training

The labels (/-) in the training and testing
sets
should be in about right proportion

Doing the random splitting in positive class and
negative class respectively will guarantee it

This procedure is called stratification

Leave-one-out cross-validation if k of data
point

No random sampling is involved but nonstratified

37
How to Compare Two Classifiers? Testing
Hypothesis Paired t-test

We compare two learning algorithms by comparing
the average error rate over several
cross-validations

Assume the same cross-validation split can be
used
for both methods

38
How to Evaluate Whats Been Learned? When cost is
sensitive

Two types error will occur
False Positives (FP) False
Negatives (FN)

39
ROC CurveReceiver Operating Characteristic Curve

An evaluation method for learning models.
What it concerns about is the Ranking of
instances made by the learning model.
A Ranking means that we sort the instances w.r.t.
the probability of being a positive instance from
high to low.
ROC curve plots the true positive rate (TPr) as a
function of the false positive rate (FPr).

40
An Example of ROC Curve
41
Using ROC to Compare Two Methods
Under the same FP rate, method A is better than
method B
42
What if There is a Tie?
Which one is better?
43
Area under the Curve (AUC)

An index of ROC curve with range from 0 to 1.
An AUC value of 1 corresponds to a perfect
Ranking (all positive instance are ranked high
than all negative instance).
A simple formula for calculating AUC

44
Performance Measures in Information Retrieval
(IR)

An IR system, such as Google, for given a query
(key words search) will try to retrieve all
relevant
documents in a corpus

Documents returned that are NOT relevant FP

The relevant documents that are NOT return FN

45
Balance the Tradeoff between Recall and
Precision F-measure

Two extreme cases

Return only one document with 100 confidence
then precision1 but recall will be very small

Return all documents in the corpus then
recall1 but precision will be very small

46
Curse of Dimensionality Deal with High
Dimensional Datasets

Learning in very high dimensions with very few
samples. For example, microarray dataset

Colon cancer dataset
2000 of gene vs. 62 samples

Acute leukemia dataset
7129 of gene vs. 72 samples

In text mining, there are many useless words
which
were called stopwords, such as is, I, and

Feature selection will be needed

47
Feature Selection Filter Model Using Fisher-like
Score Approach
Feature 1
Feature 2
Feature 3
48
Weight Score Approach
Weight score
where
and
are the mean and standard deviation of
feature for training examples of positive or
negative class.
49
Two-class classification vs. Binary Attribute

Test the class label and a single attribute if
they
are significantly correlated with each
other

50
Mutual Information (MI)
51
Ranking Features for Classification Filter Model

The perfect feature selection should consider
all
possible subsets of features

For each subset train and test a classifier

Retain the subset that resulted in the highest
accuracy (computational infeasible)

Measure the discrimination ability of each
feature

Rank the features w.r.t. this measure and select
top p features

Highly linear correlated features might be
selected

52
Reuters-2157821578 docs 27000 terms, and 135
classes

21578 documents
1-14818 belong to training set
14819-21578 belong to testing set
Reuters-21578 includes 135 categories by using
ApteMod version of the TOPICS set
Result in 90 categories with 7,770 training
documents and 3,019 testing documents

53
Preprocessing Procedures (cont.)