4' Molecular Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

4' Molecular Similarity

Description:

ALL substructures stored in a bit-vector using a hashing scheme plus lossy ... Superstructure and Substructure Searches. A is a superstructure of B (ignoring H) ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 32
Provided by: Sho57
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: 4' Molecular Similarity


1
4. Molecular Similarity
2
Similarity and Searching
  • Historical Progression
  • Similarity Measures
  • Fingerprint Construction
  • Pathological Cases
  • MinMax- Counts
  • Pruning Search Space
  • Aggregate Queries
  • LSH

3
Historical Progression
  • Maximum Common Subgraph-Isomorphism (MCS)
  • maximum common substructure between to molecules.
  • NP-complete
  • Structural Keys
  • dictionary of predetermined, domain-specific
    sub-structures keyed to particular positions in a
    bit-vector constructed for each molecule
  • similarity computed between bit-vectors (fast
    O(D) scan)
  • 2D Compressed Fingerprints
  • ALL substructures stored in a bit-vector using a
    hashing scheme plus lossy compression (modulo
    operator)
  • Similarity computed between bit-vectors or count
    vectors
  • Faster Searches
  • database pruning
  • locality sensitive hashing (LSH) towards O(log
    n) similarity searching

4
Superstructure and Substructure Searches
B
A
  • A is a superstructure of B (ignoring H)
  • B is a substructure of A
  • Tversky similarity

5
The Similarity Problem
  • How similar?

6
Spectral Similarity
  • Count substructures
  • Compare the count/bit vectors

7
2D Graph Substructures
  • For chemical compounds
  • atom/node labels
  • A C,N,O,H,
  • bond/edge labels
  • B s, d, t, ar,
  • Trace ALL Paths
  • O(Ndl)
  • Cycles and trees
  • Combinatorial Space

(CsNsCdO)
8
Mapping Structures to Bits
  • Compact data representation
  • Hash each path to bit vector Feature space ? Bit
    space
  • Resolve clashes with OR operator (i.e 111)

9
Similarity Measures
  • There are many ways of measuring similarity (or
    distance) between bit/count vectors
  • Euclidean
  • Cosine
  • Exponentials
  • Tanimoto/Jaccard
  • Tversky
  • MinMax
  • And many more (L1,L2,Lp,Hamming, Manhattan,.)

10
(No Transcript)
11
(No Transcript)
12
Similarity Measures Tanimoto
  • Tally features
  • Unique (a,b)
  • Both on (c)
  • Both off (d)
  • Similarity Formula
  • Tanimotoc/(abc)

13
The Fingerprint Approximation
  • Fingerprint bit similarity approximates chemical
    feature similarity.

14
Similarity Measures Tversky
  • Tally features
  • Unique (a,b)
  • Both on (c)
  • Both off (d)
  • Similarity Formula
  • Tanimotoc/(abc)
  • Tversky(a,ß)c/(aaßbc)

15
Pathological Cases
  • On the Properties of Bit String-Based Measures of
    Chemical Similarity. Flower DR, J. Chem. Inf.
    Comput. Sci. 1998, 38, 379-386

16
Pathological Cases
  • Issue of labeling scheme.

17
Counts
  • MinMax similarity is a generalization of Tanimoto
    which uses the counts.
  • MinMax can work better than Tanimoto.

18
Pruning Search Space Using Bounds
  • Linear speedup (search CxD) for fixed threshold,
    often by one order of magnitude or more.
  • Sub-linear speedup (search CxD0.6) for top K.

19
(No Transcript)
20
Speedup from Pruning
  • Speedup depends on
  • Threshold
  • Query
  • Fingerprint length
  • Database size

21
(No Transcript)
22
(No Transcript)
23
Bias in Query Distribution
24
(No Transcript)
25
(No Transcript)
26
Aggregate Queries (Profiles)
27
Two Basic Strategies
  • Similar to bioinformatics
  • Aggregate individual pairwise measures
  • Build a fingerprint profile
  • Linear approaches
  • Non-linear approaches (consensus, modal, etc)
  • Hybrid (profile aggregation/scaling))
  • Profile-profile

28
Aggregations
29
Consensus Fingerprints
  • Create consensus fingerprint
  • Search database using the consensus



30
Local Sensitive Hashing
  • Bin fingerprints based on projections onto
    randomly directed vectors
  • log D random vectors ? O(log D)
  • Search for neighbors by returning bin
    corresponding to the querys projection
  • Has been used for clustering. May be useful for
    building diverse data sets. Not yet developed for
    searching

31
Outline
  • Historical Progression
  • Similarity Measures
  • Fingerprint Construction
  • Pathologic Cases
  • MinMax- Counts
  • Pruning Search Space
  • Aggregate Queries
  • LSH
Write a Comment
User Comments (0)
About PowerShow.com