Fingerprint Clustering Comparative Study - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Fingerprint Clustering Comparative Study

Description:

This presentation will probably involve audience discussion, which will create action items. ... frequent itemset containing un-clustered fingerprints as a ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 22
Provided by: TBi79
Category:

less

Transcript and Presenter's Notes

Title: Fingerprint Clustering Comparative Study


1
Fingerprint Clustering Comparative Study
  • This presentation will probably involve audience
    discussion, which will create action items. Use
    PowerPoint to keep track of these action items
    during your presentation
  • In Slide Show, click on the right mouse button
  • Select Meeting Minder
  • Select the Action Items tab
  • Type in action items as they come up
  • Click OK to dismiss this box
  • This will automatically create an Action Item
    slide at the end of your presentation with your
    points entered.
  • 606 Project Presentation by
  • Dean Cheng

2
Outline
  • A Brief Background Introduction
  • Some Definitions
  • Clustering Algorithms
  • Results
  • Conclusion and Future directions

3
Oligonucleotide Fingerprinting (1)
  • DNA clone long active linear sequence of
    nucleotides (unknown)
  • Probe short linear sequence of 6-8 nucleotides
    (known)

Hybridizations A-T A C A G G-C G
T T G A-T G A G T C-G G-C C
A T-A A-T G G (Total) (Half) (None)
4
Oligonucleotide Fingerprinting (2)
  • Fluorescent labeled probes, higher intensity
    better hybridization

5
Oligonucleotide Fingerprinting (3)
Fingerprint vector A vector of intensity values
of all probes for a clone. Example c1 has a
fingerprint of 1 0 1 0 0 .
6
Oligonucleotide Fingerprinting (4)
  • Applications
  • DNA sequencing
  • Gene expression
  • DNA clone classification
  • Clustering
  • Similarity/Distance functions
  • Algorithms
  • Focus on binary (ternary) domain

7
Project Specifications
  • Three algorithms and two fingerprint
    representations.
  • UPGMA-0-1, FREQ-0-1, UPGMA-0-1-N, GCP-0-1-N
  • 0-1-N 2 thresholds
  • Intensity above positive control 1
  • Intensity below negative control 0
  • Anything in between N (unknown)
  • Hamming distance. For 0-1-N fingerprint, ignore
    N.

8
Binary Clustering with Missing Values (BCMV)
  • Two 0-1-N vectors are compatible if they do not
    differ at any position or they only differ at
    positions where one or both of them has N.
  • A 0-1-N vector is resolved if value of N is
    determined (0 or 1).
  • The challenge is to cluster only mutually
    compatible 0-1-N fingerprint vectors such that
    fingerprint vectors within a cluster can be
    resolved in the same way.

9
Graphs and 0-1
  • For 0-1 fingerprint vectors, we can try to find
    resolve vector of 1-hamming distance.
  • For example, the set 100, 010, 001 has a resolve
    vector of 000.

10
GCP (Greedy Clique Partition)
  • The algorithm first finds unique maximal cliques
    and removes them from the graph.
  • Then a greedy action is used to find and remove
    maximum cliques from the graph until all vertices
    in the graph belong to some clique.
  • Implementation from Zhipeng Cai.

11
UPGMA
  • Given a distance matrix, join two nodes most
    similar to each other and update distance matrix
    with average distance to the two nodes. Output
    is a tree.
  • PHYLIPs neighbor joining has UPGMA.

12
FREQ (1)
  • Frequent itemset mining find itemset that has a
    support above a minimum support.

Support of 1, 2, 3 is 40. Support of 1, 2
is 60.
Apriori if an itemset is frequent then all its
subsets must be frequent.
Eg 1 80, 260, 340, 1-260, 2-3 40
13
FREQ (2)
  • Transform fingerprints into transactions use a
    fingerprint itself and its compatible
    fingerprints.
  • Eg
  • 1. 111001NN T1 1, 2, 3, 4, 5
  • 2. 11100N1N T2 1, 2, 3, 4
  • 3. 11100N11 T3 1, 2, 3, 4
  • 4. 1110N11N T4 1, 2, 3, 4
  • 5. 1110N10N T5 1, 5

14
FREQ (3)
  • Can do same for 0-1 vectors using 2-hamming
    distance (eg. 010, 100, 001 has a resolve vector
    of 000).
  • Experiment preprocess the fingerprints first,
    only uses fingerprints that do not have 0-hamming
    distance identical. An implementation issue.
    Eg 101, 101, 111, 111, 100, 010, 001.
  • Take the longest frequent itemset containing
    un-clustered fingerprints as a cluster.

15
Validation methods
  • Number of cluster with size gt 2. Number of
    singleton.
  • Average homogeneity and average separation.
    Euclidean distance and hamming distance used.
  • Minkowski measure and Jaccards coefficient using
    GCPs result as standard.

16
Datasets and thresholds
  • Bacteria dataset 1491 clones and 27 probes.
    Fungi dataset 1507 clones and 26 probes.
  • Thresholds
  • Bacteria 1 0.35-0.50 and 0.425
  • Bacteria 2 0.35-0.55 and 0.45
  • Fungi 425-775 and 600

17
Results
18
Discussion (1)
  • GCP-0-1-N perform the best, guarantee to find
    mutually compatible clusters.
  • UPGMA-0-1-N also return good results but fewer
    larger-size cluster.
  • Both GCP-0-1-N and UPGMA-0-1-N outperform
    UPGMA-0-1 and FREQ-0-1. UPGMA-0-1-N superior to
    UPGMA-0-1. 0-1-N is a better representation.
  • UPGMA find more singleton and fewer larger-size
    clusters.

19
Discussion (2)
  • FREQ-0-1 better than UPGMA-0-1 but clustering
    qualities unstable. FREQ-0-1 is comparable to
    GCP-0-1-N and UPGMA-0-1-N in Bacteria 1.
  • Must add constraints to FREQ. Incremental
    decrease of support can help efficiency and
    quality.
  • Minkowski measure and Jaccards coefficient
    indicate clustering solutions different.
    UPGMA-0-1-N gt FREQ-0-1 gt UPGMA-0-1.

20
Conclusion Future Directions
  • Uses cluster size homogeneity, separation,
    Minkowski measure and Jaccards coefficient to
    evaluate cluster qualities.
  • 0-1-N is a better representation. Use UPGMA-0-1-N
    instead of UPGMA-0-1.
  • GCP perform the best.
  • FREQ works in both 0-1 and 0-1-N but need more
    testing. Adding constraints are key. Other
    frequent itemset mining algorithms can be
    explored such as Max-Miner. Assigning a
    fingerprint to multiple clusters a benefit?

21
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com