FINAL PROJECT - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

FINAL PROJECT

Description:

Calculate the distance between the instance I and witness element Z; ... We call the conventional function used in combination with the ADC, the ADC-sub classifier. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 20
Provided by: users
Category:

less

Transcript and Presenter's Notes

Title: FINAL PROJECT


1
FINAL PROJECT
  • COURSE NAME Principles of Data Mining
  • COURSE CODE COP-5992
  • PROFESSOR Dr. Tao Li
  • APPROXIMATE DISTANCE CLASSIFICATION
  • BY
  • RAMAKRISHNA VARADARAJAN

2
AN OVERVIEW
  • THE PROBLEM Curse of Dimensionality.
  • THE SOLUTION.
  • THE METHODS.
  • ADC PROJECTIONS.
  • EXAMPLES.
  • EVALUATING THE PROJECTIONS.
  • IMPLEMENTATION OF THE METHOD.
  • RESULTS.

3
THE PROBLEM
  • Classification and clustering in high dimensions
    are NOTORIOUSLY DIFFICULT PROBLEMS.
  • Instead of searching for clustering structure of
    the original, highdimensional observations, it
    is common practice to employ dimension reduction
    methods.
  • The question How to project? naturally arises.

4
THE SOLUTION
  • In 1997, Cowen and Priebe 4 introduced a class
    of non linear projections that is easy to
    construct and has been demonstrated to preserve
    clustering structure in high-dimensional data
    sets that strongly cluster.
  • The motivation behind their work is to reduce
    dimensionality while approximately preserving
    inter-cluster distances.
  • The projections developed by Cowen and Priebe
    approximately preserve inter-class distances.

5
THE METHOD
  • Consequently the classification and clustering
    techniques based on them are referred to as
    Approximate Distance Classification and
    Clustering methods or ADC methods for short.
  • ADVANTAGES
  • No Pre-processing needed.
  • No Data dependent adjustments needed.
  • Performs surprisingly well on very high
    dimensions in which the conventional methods fail
    for theoretical and computational reasons.
  • Very Simple and Effective technique.

6
ISSUES TO BE ADDRESSED
  • While using ADC METHODS we come
  • across the following issues
  • How many projections do we need to generate to
    get some that are useful?
  • How do we distinguish the best'' (or most
    useful) projections from the rest?

7
ADC PROJECTIONS
  • Given a set of observations in a highdimensional
    space we first seek a projection of the data into
    a lowerdimensional space for which approximate
    inter-cluster distances are maintained.
  • Definition Let SX1,X2,,Xn be a collection
    of n vectors (n-instances) in d-dimension
    (d-attributes). Let D be a subset of S
    (instances) and . denote the L2 norm. The
    associated ADC map is defined as the function.
  • ADC-D Xi ? min Xi Z . where Z is an
    element of D.
  • L2 NORM

8
WITNESS SETS
  • The set D (the subset of instance set) in the
    above definition will be referred to as the
    witness set that generates its associated
    projection.
  • Clearly each ADC map is completely determined by
    the witness set used and each determines a
    projection from d-dimension to 1-dimension.
  • In what follows, we will always choose the
    witness set entirely from one of the classes,
    without loss of generality.

9
THE ALGORITHM
  • For each possible witness set S sampled from the
    data(only from a single class) This loop
    iterates for all different possible witness sets
  • For each instance I of the data This loop
    iterates for all data instances
  • For each element Z of the above selected witness
    set This loop iterates for the number of
    witness elements
  • Calculate the distance between the instance I and
    witness element Z
  • Find the minimum of the above calculated
    distances, d-min
  • The Instance I is mapped to d-min in the
    lower-dimensional space
  • Evaluate the resulting projection using CROSS
    VALIDATION
  • THEN SELECT THE BEST PROJECTION !!!

10
EXAMPLE ADC PROJECTION
  • IRIS DATA
  • Number of Instances 150 (50 in each of three
    classes)
  • Number of Attributes 4 numeric, predictive
    attributes and a class
  • attribute.
  • Example instances of iris data ( first five)
  • 5.1,3.5,1.4,0.2,Iris-setosa
  • 4.9,3.0,1.4,0.2,Iris-setosa
  • 4.7,3.2,1.3,0.2,Iris-setosa
  • 4.6,3.1,1.5,0.2,Iris-setosa
  • 5.0,3.6,1.4,0.2,Iris-setosa
  • 5.4,3.9,1.7,0.4,Iris-setosa
  • In our definition
  • S set of all instances D subset of instances
    called witness set
  • Lets select a witness set of size 2 randomly
    (always within one class). While doing ADC
    projection we dont take class attribute to
    consideration for dimensionality reduction.
  • D (5.0,3.6,1.4,0.2) and (5.4,3.9,1.7,0.4)

11
EXAMPLE ADC PROJECTION (continued)..
  • Now we apply the ADC projection for the first
    instance of IRIS data First Instance
    5.1,3.5,1.4,0.2.
  • Calculate distance between first instance and
    (5.0,3.6,1.4,0.2) by using L2 norm( Distance
    Formula). Result will be 0.141.
  • Calculate distance between first instance and
    (5.4,3.9,1.7,0.4) by using L2 norm( Distance
    Formula). Result will be 0.616.
  • Then find the minimum of the 2 results. In our
    case it is 0.141.
  • So the first instance is projected to 0.141.
  • Repeat this procedure for all the remaining
    instances to get the overall projection of the
    IRIS data by the selected witness set.

12
IDENTIFYING GOOD WITNESS SETS
  • You can vary the size of the witness set and also
    the elements in the witness set, but within a
    single class to preserve the inter-cluster
    distances and relationships.
  • Once the data is projected using a selected
    witness set, the resulting projected data is
    classified using a conventional method( for
    example K-nearest neighbor or any other
    classification algorithm).
  • We call the conventional function used in
    combination with the ADC, the ADC-sub classifier.

13
EVALUATING THE PROJECTIONS
  • There are many existing methods to measure the
    quality of a projection.
  • The projection generated by D can be evaluated
    with respect to a particular ADC sub-classifier
    by using CROSS-VALIDATION.
  • When presented with a new unlabeled observation
    X, the method classifies by projecting X to
    one-dimensional space using ADC-D and then
    labeling ADC-D (X) using the selected ADC-sub
    classifier.

14
PROCEDURE SO FAR..
  • First, we sample w witness sets from the set of
    all size s subsets of the training data in a
    single class. For example In IRIS data, there
    are 3 classes( 50 instances in each). So if we
    select the size s to be 3, there are 50 C 3
    19600 possible witness sets and we can sample
    any number (w) from it for projection.
  • Evaluate the witness sets using Cross-Validation.
  • Now we select the r best scoring witness sets,
    where r is called filtering parameter.
  • THE PARAMETERS w, s AND r CAN BE VARIED.

15
IMPLEMENTATION
  • A program in java that reads the IRIS data and
    performs ADC projections on it.
  • THE INPUT size and the elements in the witness
    set chosen.
  • THE OUTPUT The reduced 1-dimension arff file
    that can be classified using a ADC sub-classifier
    (e.g. k-nearest neighbor) later on to evaluate
    the witness set chosen.
  • The program reads the iris.txt file in arff
    format and performs the mapping using the witness
    set parameters from the user.

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
THANKS
  • QUESTIONS ?
Write a Comment
User Comments (0)
About PowerShow.com