4' Molecular Similarity - PowerPoint PPT Presentation

About This Presentation

Title:

4' Molecular Similarity

Description:

ALL substructures stored in a bit-vector using a hashing scheme plus lossy ... Superstructure and Substructure Searches. A is a superstructure of B (ignoring H) ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 32

Provided by: Sho57

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: 4' Molecular Similarity

1
4. Molecular Similarity
2
Similarity and Searching

Historical Progression
Similarity Measures
Fingerprint Construction
Pathological Cases
MinMax- Counts
Pruning Search Space
Aggregate Queries
LSH

3
Historical Progression

Maximum Common Subgraph-Isomorphism (MCS)
maximum common substructure between to molecules.
NP-complete
Structural Keys
dictionary of predetermined, domain-specific
sub-structures keyed to particular positions in a
bit-vector constructed for each molecule
similarity computed between bit-vectors (fast
O(D) scan)
2D Compressed Fingerprints
ALL substructures stored in a bit-vector using a
hashing scheme plus lossy compression (modulo
operator)
Similarity computed between bit-vectors or count
vectors
Faster Searches
database pruning
locality sensitive hashing (LSH) towards O(log
n) similarity searching

4
Superstructure and Substructure Searches
B
A

A is a superstructure of B (ignoring H)
B is a substructure of A
Tversky similarity

5
The Similarity Problem

How similar?

6
Spectral Similarity

Count substructures
Compare the count/bit vectors

7
2D Graph Substructures

For chemical compounds
atom/node labels
A C,N,O,H,
bond/edge labels
B s, d, t, ar,
Trace ALL Paths
O(Ndl)
Cycles and trees
Combinatorial Space

(CsNsCdO)
8
Mapping Structures to Bits

Compact data representation
Hash each path to bit vector Feature space ? Bit
space
Resolve clashes with OR operator (i.e 111)

9
Similarity Measures

There are many ways of measuring similarity (or
distance) between bit/count vectors
Euclidean
Cosine
Exponentials
Tanimoto/Jaccard
Tversky
MinMax
And many more (L1,L2,Lp,Hamming, Manhattan,.)

10
(No Transcript)
11
(No Transcript)
12
Similarity Measures Tanimoto

Tally features
Unique (a,b)
Both on (c)
Both off (d)
Similarity Formula
Tanimotoc/(abc)

13
The Fingerprint Approximation

Fingerprint bit similarity approximates chemical
feature similarity.

14
Similarity Measures Tversky

Tally features
Unique (a,b)
Both on (c)
Both off (d)
Similarity Formula
Tanimotoc/(abc)
Tversky(a,ß)c/(aaßbc)

15
Pathological Cases

On the Properties of Bit String-Based Measures of
Chemical Similarity. Flower DR, J. Chem. Inf.
Comput. Sci. 1998, 38, 379-386

16
Pathological Cases

Issue of labeling scheme.

17
Counts

MinMax similarity is a generalization of Tanimoto
which uses the counts.
MinMax can work better than Tanimoto.

18
Pruning Search Space Using Bounds

Linear speedup (search CxD) for fixed threshold,
often by one order of magnitude or more.
Sub-linear speedup (search CxD0.6) for top K.

19
(No Transcript)
20
Speedup from Pruning

Speedup depends on
Threshold
Query
Fingerprint length
Database size

21
(No Transcript)
22
(No Transcript)
23
Bias in Query Distribution
24
(No Transcript)
25
(No Transcript)
26
Aggregate Queries (Profiles)
27
Two Basic Strategies

Similar to bioinformatics
Aggregate individual pairwise measures
Build a fingerprint profile
Linear approaches
Non-linear approaches (consensus, modal, etc)
Hybrid (profile aggregation/scaling))
Profile-profile

28
Aggregations
29
Consensus Fingerprints

Create consensus fingerprint
Search database using the consensus

30
Local Sensitive Hashing

Bin fingerprints based on projections onto
randomly directed vectors
log D random vectors ? O(log D)
Search for neighbors by returning bin
corresponding to the querys projection
Has been used for clustering. May be useful for
building diverse data sets. Not yet developed for
searching

31
Outline