Compr

Version 5.0.6

Contents

 

Introduction

Compr compares two sets of objects (like compound libraries) using diversity and dissimilarity calculations.

This document mentions molecules as the entities to be compared, but the software can also be used for other types of objects.

The algorithm applies nearest neighbor searching that finds molecules similar to the query object. The calculation applies the Tanimoto (or Jaccard) coefficient that is calculated by the following formula in the case of binary fingerprint (bit string) input:

T(A,B) = NA&B/(NA+NB-NA&B)

where NA and NB are the number of bits set in the bit strings of molecules A and B, respectively, NA&B is the number of bits that are set in both.

When only binary fingerprints are used for the calculation of the dissimilarity between molecules, then the formula of the dissimilarity of molecule A and B is

D(A,B) = 1-T(A,B)

where T(A,B) is the Tanimoto coefficient for molecule A and B.

When other columns are also used, a weighted Euclidean distance calculation is applied:

D(A,B) = sqrt{[1-T(A,B)] + w1[C1(A)-C1(B)]2 + w2[C2(A)-C2(B)]2 + ...}

where

Instead of the brute force method, Compr applies heuristics to avoid calculating all pairwise dissimilarity calculations and neighbor list comparisons.

 

Usage

    compr [<options>]

Prepare the usage of the compr script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Or call the Compare class directly:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.Compare [<options>]

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
    chemaxon.clustering.Compare [<options>]

Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.

Options

General options:
  -h  --help                    this help message
  -d  --driver <JDBC driver>    JDBC driver
  -u  --dburl <url>             URL of database
  -l  --login <login>           login name
  -p  --password <password>     password
  -P  --proptable <tablename>   name of property table
  -s  --saveconf                save settings into ~/.jchem

Input options (default: standard input):
  -i  --input <path1> <path2>   input files to compare (text file input)
  -q  --query <sql1> <sql2>     SQL query strings for reading input
                                (database input)
Combined input (table vs. file comparison):
  -i <path> -q <sql>

Output options (default: standard output):
  -o  --output <filepath>       output file path (text file output)
  -a  --statement <sql>         SQL statement for inserting results
                                (database output)
  -z  --stat                    print statistics on overall dissimilarity
  -Z  --only-stat               statistics only on overall dissimilarity
  -y  --only-dissimilar         print only dissimilar objects
  -L  --list-similar            list similar objects from the first set
  -v  --verbose                 verbose output

Data properties
  -m  --dimensions <dim>        number of floating-point descriptors
  -f  --fingerprint-size <bits> binary fingerprint size in bits.
                                fpsize should be a multiple of 32
  -w  --weights <w1> <w2> ...   the weights of the floating-point descriptors				
  -g  --generate-id             generate id for each compound.

Conditions of calculation
  -t  --threshold <threshold>   maximum dissimilarity of two compounds
                                compounds (default: 1)
  -D  --different-ids           don't compare compounds with the same id
  -c  --max-count <count>       maximum number of similar objects listed
  -r  --order                   sort similar objects by distance (closest first)

Warning! Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.

 

Loading/Saving of Settings

It would be inconvenient to enter all of the parameters of the compr script at each run. To overcome this problem, it is possible to save some of the settings that are not changed frequently in the .jchem file stored in the user's home directory. Use the --saveconf option to store the following settings:

The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Compr manually.

 

Database Connections

For more information on setting connection parameters:

please visit the Administration Guide of JChem.

 

Input

Two data sources are retrieved, which contain the data of molecule libraries to be compared. The software may import data from either text files (--input) or database tables (--query). Each input data source must contain the following columns:

Columns Type Content
Id Integer numbers Id of compounds
(Optional in text files)
fp1, fp2, fp3 ... Integer numbers Binary fingerprints in integer number blocks
The number of fp. columns is
   fp. length / 32
(Optional)
d1, d2, d3, ... Floating point numbers Other descriptors
(Optional)

Comments:

 

Output

The software can write the results of the calculation into either a text file (--output) or a database table (--statement).

Each row of the output belong to a compound in the second set. The following symbols are used in the description of columns:
L1, L2 The compound libraries specified in this order in the call of Compr
C The compound from L2, which belongs to the specific row
D(C,Ai) The dissimilarity between C and compound i from L1

The exported data contains a subset of the following columns:

Columns Type Content
Id Integer numbers The identifier of C
minD Floating point numbers The dissimilarity of C and its nearest neighbor from L1: min(D(C,Ai))
nneib Integer numbers The nearest neighbor of C in L1
simcnt Integer numbers The number of objects in L1, which are similar to C. (Their dissimilarity is lower than or equal to the specified threshold value.)
avgD Floating point numbers The average dissimilarity of C and all compounds from L1: sum(D(C,Ai))/N(L1)
maxD Floating point numbers The maximum dissimilarity between C and all compounds from L1: max(D(C,Ai))
list_of_similar_objects Integer numbers in several columns The list of molecules with a dissimilarity below the threshold (neighbors) in one row.

Only the column "Id" is printed if the --different-ids option is specified. "avgD" and "maxD" columns are written only if either the --statistics or the --only-statistics option is given. The "list_of_similar_objects" columns are printed if the --list-similar option is specified.

Comments for database output:

 

Diversity Statistics

Optionally, Compr can print diversity statistics into the standard output or the given output file. The parameters that enable statistics printing are --statistics or --only-statistics. (The latter one doesn't allow to print information on individual compounds.) The following data will be printed:

The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)

Comments on Some Parameters

--fingerprint-size
The number of binary fingerprint columns multiplied by 32 (because the bit-length of integer numbers is 32 in Java)
--dimensions
Specifies the number of other columns. If only binary fingerprints are used in the clustering process, then this parameter doesn't have to be set.
--weights
When other columns are used, a weighted Euclidean distance calculation may be applied. If there are also binary fingerprint columns, weights are relative to the Tanimoto coefficient calculated from the binary fingerprints (the Tanimoto coefficient has a weight of 1.0).
--threshold
Compounds with a dissimilarity below the threshold will be considered similar.
--different-ids
Compounds with the same id will not be compared. This is useful in any of the following cases:
  • The two compound sets are not disjunct (some of the compounds are the same).
  • If self dissimilarity is tested, that is when the two compound sets are the same.
The precondition to use this option is that the id values of the same compounds are the same.

By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.

 

Examples

In the examples it is supposed that, when needed, all connection parameters are set and stored by JChemManager (or a previous saving by Compr)

  1. A batch file (Windows) for reading from a database and writing the id of all compounds in the commercial_library table that are similar to the structures in the home_library table to the standard output:
    set QUERY1="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM home_library"
    set QUERY2="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM commercial_library"
    compr -q %QUERY1% %QUERY2% -t 0.1 -f 512 -D
    
  2. A UNIX shell script for reading binary fingerprints from two database tables (home_library and commercial_library) and insert the results into another table. It collects id-s and calculates dissimilarity information on structures in the commercial_library, which are similar to some structures in the home_library:
    QUERY1="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM home_library"
    QUERY2="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM commercial_library"
    INSERT="INSERT INTO simil(cd_id,  minD, nneib, simcnt) VALUES(?,?,?,?)"
    compr -q "$QUERY1" "$QUERY2" -a "$INSERT" -t 0.1 -f 512
    

    Make sure that the simil table exists and is empty before running the script.

  3. Full statistics calculation using the output of GenerateMD (generated id values are needed)
    generatemd c home_library.smi -k CF -c cfp.xml -D -o home_library.fp
    generatemd c commercial_library.smi -k CF -c cfp.xml -D -o commercial_library.fp
    compr -f 512 -t 0.1 -g -z -i home_library.fp commercial_library.fp >compr_result.tab
    
    Example for the result file:
    id	minD	nneib	simcnt	avgD	maxD
    1	0.165	40092	0	0.531	0.770
    2	0.056	8998	1	0.546	0.766
    ....
    999	0.095	365	1	0.521	0.804
    1000	0.103	221	0	0.527	0.821
    
    STATISTICS
    
    Number of objects in set 1 = 90000
    Number of objects in set 2 = 1000
    Minimum dissimilarity between sets = 0.0057803392
    Average dissimilarity between sets = 0.543819
    Maximum dissimilarity between sets = 0.82962966
    
  4. Self-dissimilarity test of the home library:
    compr -f 512 -t 0.1 -g -z -i home_library.fp home_library.fp >compr_result.tab
    
  5. Displaying the structures and the results of a full statistics calculation using the CreateView and MarvinView applications:
    • Creating an SDfile containing the calculated data (the minD, avgD, and simcnt columns) and the structures:
      crview -i id -d "minD:avgD:simcnt" -s commercial_library.sdf -t compr_result.tab > compr_result.sdf
      
    • Displaying the structures and the following data: CAS_RN (comes from the original SDfile), minD, avgD, and simcnt:
      mview  -r 3 -f "CAS_RN:minD:avgD:simcnt" compr_result.sdf
      
    A screenshot of the results shown by MarvinView:

 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.