Title: Mathematical Programming in Support Vector Machines
1We Are Overwhelmed with Data
- Data collecting and storing are no longer
- expensive and difficult tasks.
- Earth-orbiting satellite transmits terabytes of
- data every day
- New York Stock Exchanges processes, on
- average, 333 million transactions per day
- The gap between the generation of data
- and our understanding of it is growing
- explosively
2It Is Not Enough to Store Raw Data
- Raw data only record the fact but nothing else
- We want to learn something from the raw data
- We want the hidden information or knowledge
- in raw data
- The idea of an easily accessible, individually
- configurable storehouse of knowledge, the
- beginning of the literature on mechanized
- information retrieval.
3What is Data Mining?
(Get the Patterns that You Cant See)
- Computer makes it possible.
4Data Mining Applications
(Why Amazon Know What You Need?)
- Consumer behavior analysis
- Consumer relation management (CRM)
- Disease diagnosis and prognosis
- Microarray gene expression data analysis
- Sport NBA coaches latest weapon
5Can Computer Read? Text Classification Web
Mining
- Publication has been extended far beyond
- our present ability to make real use of the
- record
- V. Bush, As we may think, Atlantic Monthly,
176(1945), - p.101-108
6(No Transcript)
7(No Transcript)
8(No Transcript)
9The Beginning of Information Retrieval Memex
The idea of an easily accessible, individually
configurable storehouse of knowledge, the
beginning of the literature on mechanized
information retrieval
Vannevar Bush (1945)
10Digital Information Repositories
- World Wide Web (gt10 billion documents)
- Digital Libraries (e.g. CDL)
- Special purpose content providers (e.g. Lexis
Nexis) - Company intranets and digital assets
- Scientific literature libraries (e.g. citeseer)
- Medical information portals (e.g. MedlinePlus)
- Patent databases (e.g. US Patent Office)
- ...
11(No Transcript)
12Information Retrieval
- Core problem of information retrieval (Maron
Kuhns 1960) - Facets
- Query-based search
- Categorizing and annotating documents
- Organizing and managing information repositories
- Assessing the quality and authoritativeness of
information sources - Understanding a users information need
Adequately identifying the information content of
documentary data
13Machine Learning Information Retrieval
- Information content is not directly observable
- Relevant entities such as concepts, topics,
categories, named entities, user interests, etc.
need to be inferred based on indirect evidence - Observables (raw data) text, annotations,
document structure, link structure, user actions
ratings, ... - Inference problem automatically infer
information content based on raw data
14IR vs. RDBMS
- RDBMS (relational database management system)
- Semantics of each object are well-defined
- Complex query languages
- You get exactly what you ask for
- Emphasis is on efficiency, ACID properties (i.e.,
automicity, consistency, isolation, durability) - IR
- Semantics of each object ill-defined people
dont agree - Usually simple query languages
- You get what you want, even if you asked for it
badly
15IR vs. RDBMS
- Merging trends
- RDBMS ? IR
- Answer fuzzy questions
- e.g., find important publications on quantum
mechanism around 1945? - IR ? RDBMS
- Information extraction
- unstructured data ? structured data
- Semi-structured data, e.g., XML files
16IR Trend
From the Jamie Callans lecture slide
17Preprocessing of Text Classification Convert
documents into data input
- Stopwords such as a, an, the, be, on
- Eliminating stopwords will reduce space and
- improving performance
- Polysemy (????) can verb noun
- Stemming or Conflation using Porters algorithm
- Stemming increases the number of documents
- in the response but also irrelevant documents
18A School of Fish
19Fundamental Problems in Data Mining
- Classification problems (Supervised learning)
- Test classifier on fresh data to evaluate success
- Classification results are objective
- Decision trees, neural networks, support vector
- machines, k-nearest neighbor, Naive Bayes,
etc.
- Linear nonlinear regression
20Fundamental Problems in Data Mining
- Feature selection Dimension Reduction
- Occams razor the simplest is the best
- Too many features could degrade generalization
- performance, curse of dimensionality
- Clustering (Unsupervised learning)
- Clustering results are subjective
- k-means algorithm and k-medians algorithm
- Minimum support and confidence are required
21Binary Classification ProblemLearn a Classifier
from the Training Set
Given a training dataset
Main goal Predict the unseen class label for new
data
22 Binary Classification ProblemLinearly Separable
Case
Benign
Malignant
23Support Vector Machines for ClassificationMaximiz
ing the Margin between Bounding Planes
24Summary the Notations
25Robust Linear Programming
Preliminary Approach to SVM
26(No Transcript)
27Support Vector Machine Formulations
(Two Different Measures of Training Error)
2-Norm Soft Margin
1-Norm Soft Margin
- Margin is maximized by minimizing reciprocal of
- margin.
28Tuning ProcedureHow to determine C?
C
The final value of parameter is one with the
maximum testing set correctness !
29Two-spiral Dataset(94 White Dots 94 Red Dots)
30Support Vector Regression
Original Function
Estimated Function
Parameter
MAE of 49x49 mesh points 0.0513
Training time 22.58 sec.
31Naïve Bayes for Classification Problem Good for
Binary as well as Multi-category
- Let each attribute be a random variable. What is
the probability of the class given an instance?
- The importance of each attribute is equal
- All attributes are independent !
32The Weather Data Example
Ian H. Witten Eibe Frank, Data Mining
33Probabilities for the Weather Data
Using Frequencies to Approximate Probabilities
34The Zero-frequency Problem ????,???????!
- What if an attribute value does NOT occur with a
- class value?
- The posterior probability will all be zero! No
matter - how likely the other attribute values are!
Q ???8?,????2, 5, 6, 2, 1, 5, 3, 6??? ?????
4 ??????
35How to Evaluate Whats Been Learned? Cost is NOT
sensitive
- Measure the performance of a classifier in term
of - error rate or accuracy
Main goal Predict the unseen class label for new
data
- We have to asses a classifiers error rate on a
set - that play no role in the learning process
- Split the data instances in hand into two parts
- Training set for learning the classifier
- Testing set for evaluating the classifier
36k-fold (Stratified) Cross Validation Maximize
the usage of the data in hands
- Split the data into k approximately equal
partitions
- Each in turn is used for testing while the
remainder - is used for training
- The labels (/-) in the training and testing
sets - should be in about right proportion
- Doing the random splitting in positive class and
- negative class respectively will guarantee it
- This procedure is called stratification
- Leave-one-out cross-validation if k of data
point
- No random sampling is involved but nonstratified
37How to Compare Two Classifiers? Testing
Hypothesis Paired t-test
- We compare two learning algorithms by comparing
- the average error rate over several
cross-validations
- Assume the same cross-validation split can be
used - for both methods
38How to Evaluate Whats Been Learned? When cost is
sensitive
- Two types error will occur
- False Positives (FP) False
Negatives (FN)
39ROC CurveReceiver Operating Characteristic Curve
- An evaluation method for learning models.
- What it concerns about is the Ranking of
instances made by the learning model. - A Ranking means that we sort the instances w.r.t.
the probability of being a positive instance from
high to low. - ROC curve plots the true positive rate (TPr) as a
function of the false positive rate (FPr).
40An Example of ROC Curve
41Using ROC to Compare Two Methods
Under the same FP rate, method A is better than
method B
42What if There is a Tie?
Which one is better?
43Area under the Curve (AUC)
- An index of ROC curve with range from 0 to 1.
- An AUC value of 1 corresponds to a perfect
Ranking (all positive instance are ranked high
than all negative instance). - A simple formula for calculating AUC
44Performance Measures in Information Retrieval
(IR)
- An IR system, such as Google, for given a query
- (key words search) will try to retrieve all
relevant - documents in a corpus
- Documents returned that are NOT relevant FP
- The relevant documents that are NOT return FN
45Balance the Tradeoff between Recall and
Precision F-measure
- Return only one document with 100 confidence
- then precision1 but recall will be very small
- Return all documents in the corpus then
- recall1 but precision will be very small
46Curse of Dimensionality Deal with High
Dimensional Datasets
- Learning in very high dimensions with very few
- samples. For example, microarray dataset
- Colon cancer dataset
- 2000 of gene vs. 62 samples
- Acute leukemia dataset
- 7129 of gene vs. 72 samples
- In text mining, there are many useless words
which - were called stopwords, such as is, I, and
- Feature selection will be needed
47Feature Selection Filter Model Using Fisher-like
Score Approach
Feature 1
Feature 2
Feature 3
48Weight Score Approach
Weight score
where
and
are the mean and standard deviation of
feature for training examples of positive or
negative class.
49Two-class classification vs. Binary Attribute
- Test the class label and a single attribute if
they - are significantly correlated with each
other
50Mutual Information (MI)
51Ranking Features for Classification Filter Model
- The perfect feature selection should consider
all - possible subsets of features
- For each subset train and test a classifier
- Retain the subset that resulted in the highest
- accuracy (computational infeasible)
- Measure the discrimination ability of each
feature
- Rank the features w.r.t. this measure and select
- top p features
- Highly linear correlated features might be
selected
52Reuters-2157821578 docs 27000 terms, and 135
classes
- 21578 documents
- 1-14818 belong to training set
- 14819-21578 belong to testing set
- Reuters-21578 includes 135 categories by using
ApteMod version of the TOPICS set - Result in 90 categories with 7,770 training
documents and 3,019 testing documents
53Preprocessing Procedures (cont.)
- After Stopwords Elimination
- After Porter Algorithm
54Binary Text Classificationearn() vs. acq(-)
- Select top 500 terms using mutual information
- Evaluate each classifier using F-measure
- Compare two classifiers using 10-fold paired-t
test
5510-fold Testing ResultsRSVM vs. Naïve Bayes
There is no difference between RSVM and NB