Ward

Version 5.0.6

Contents

 

Introduction

The Ward application uses Ward's minimum variance method for clustering molecules based on molecular fingerprints or other descriptors. Murtagh's reciprocal nearest neighbor (RNN) algorithm is applied as a heuristic to achieve fast calculation times.

 

Usage

    ward [<options>]

Prepare the usage of the ward script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Or call the Ward class directly:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.Ward [<options>]

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
    chemaxon.clustering.Ward [<options>]

Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.

Options

General options:
  -h  --help                    this help message
  -d  --driver <JDBC driver>    JDBC driver
  -u  --dburl <url>             URL of database
  -l  --login <login>           login name
  -p  --password <password>     password
  -P  --proptable <tablename>   name of property table
  -s  --saveconf                save settings into ~/.jchem

Input options (default: standard input):
  -i  --input <filepath>        input file path (text file input)
  -q  --query <sql>             SQL query for reading input
                                (database input)

Output options (default: standard output):
  -o  --output <filepath>       output file path (text file output)
  -a  --statement <sql>         SQL statement for inserting results
                                (database output)
  -x  --central                 calculate and sign central objects
  -y  --singlet                 singletons get negative cluster ids
  -z  --statistics              print statistics
  -Z  --only-statistics         print only statistics
  -K  --Kelley <filepath>       print Kelley statistics into text file
  -v  --verbose                 verbose output

Data properties
  -m  --dimensions <dim>        number of floating-point descriptors
  -f  --fingerprint-size <bits> binary fingerprint size in bits
                                fpsize should be a multiple of 32
  -w  --weights <w1> <w2> ...   the weights of the floating-point descriptors				
  -g  --generate-id             generate id for each compound

Clustering parameters
  -c  --cluster-count <count>   number of clusters to be generated
  -C  --only-clustering         clusters are generated using input RNN list
  If --cluster-count is not set, then RNN list is generated on output.

Warning! Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.

 

Loading/Saving of Settings

It would be inconvenient to enter all of the parameters of the ward script at each run. To overcome this problem, it is possible to save some of the settings that are not changed frequently in the .jchem file stored in the user's home directory. Use the --saveconf option to store the following settings:

The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Ward manually.

 

Database Connections

For more information on setting connection parameters:

please visit the Administration Guide of JChem.

 

Input

The software may import data from either a text file (--input) or a database (--query). The input data must contain the following columns:

Columns Type Content
Id Integer numbers Id of compounds
(Optional in text files)
fp1, fp2, fp3 ... Integer numbers Fingerprints in integer number blocks
The number of fp. columns is
   fp. length / 32
(Optional)
d1, d2, d3, ... Floating point numbers Other descriptors
(Optional)

Comments:

 

Output

The software can write the results of clustering into either a text file (--output) or a database table (--statement).

The exported data contains the following columns:

Columns Type Content
Id Integer numbers Identifier of compounds
Clid Integer numbers Cluster identifier
Centr Integer numbers Displays whether the object is central

The last column is written only if the --central option is specified. A central object has the smallest sum of dissimilarities to the other objects in the cluster. Central object calculation slows down the application significantly.

Comments for text output:

Comments for database output:

 

Clustering Statistics

Optionally, Ward can print clustering statistics into the standard output or the given output file. The parameters that enable statistics printing are --statistics or --only-statistics. (The latter one doesn't allow to print information on individual compounds.) The following data will be printed:

The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)

 

Automatic Cluster Level Selection

Hierarchic clustering techniques, like the Ward method, can cluster the set at any chosen hierarchy level. However, in most cases, there is no obvious way to select the optimal number of clusters. Using the --Kelley <filepath> option, an optimized hierarchy level can be calculated using the Kelley method and the resulting statistics is written into the specified file.

The Kelley measure balances the normalized "spread" of the clusters at a particular level with the number of clusters at that level. For a given cluster level l, it is defined as:

where n is the number of elements in all clusters, kl is the number of clusters, AvSprl is the average spread of the cluster at level l and min(AvSpr) and max(AvSpr) are the minimum and maximum of this value across all of the cluster levels.

The spread of a cluster m is given by:

where N is the number of the members in the cluster, i and j are members of cluster m and dist(i,j) is the Euclidean distance between the two members i and j.

 

Comments on Some Parameters

--fingerprint-size
The number of binary fingerprint columns multiplied by 32 (because the bit-length of integer numbers is 32 in Java)
--dimensions
Specifies the number of other columns. If only binary fingerprints are used in the clustering process, then this parameter doesn't have to be set.
--weights
When other columns are used, a weighted Euclidean distance calculation may be applied. If there are also binary fingerprint columns, weights are relative to the Tanimoto coefficient calculated from the binary fingerprints (the Tanimoto coefficient has a weight of 1.0).
--cluster-count
The desired number of clusters.

By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.

 

Running Reciprocal Nearest Neighbor Search and Clustering Separately

Setting the --cluster-count option correctly, is important in fine tuning the clustering process. Since reciprocal nearest neighbor searching is much more time consuming than the clustering stage, it is reasonable to separate the two processes. In that case clustering can be run several times with different --cluster-count settings.

If --cluster-count is not specified, Ward collects and stores the list of RNN pairs and their distances in a text file. If this file is fed into Ward, the RNN searching is omitted. When creating the RNN list without clustering, the
 --common
 --statistics
 --only-statistics
options are not available.

If the --only-clustering option is specified for Ward, then

 

Examples

In the examples it is supposed that all connection parameters are set and stored by JChemManager (or a previous saving by Ward)

  1. A batch file (Windows) for reading from a database and writing to the standard output:
    set QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000"
    ward -q %QUERY% -c 100 -f 512
    
  2. A UNIX shell script for reading from a database and writing to another table:
    QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000"
    INSERT="INSERT INTO clusters(cd_id, cluster_id) VALUES(?,?)"
    ward -q "$QUERY" -a "$INSERT" -c 100 -f 512
    

    Make sure that the clusters table exists and is empty before running the script.

  3. Clustering using the output of GenerateMD (in Unix)
    generatemd c -k CF -c cfp.xml -D <input.smi | ward -f 512 -c 100 -g
    
  4. Clustering using pharmacophore fingerprints (in Unix)
    generatemd c -k PF -c pharma-frag.xml -D <input.smi | ward -f 0 -m 210 -c 100 -g
    
  5. Testing different -c parameters. Using the output of an RNN list generation. Singletons get negative cluster ids.
    generatemd c -k CF -c cfp.xml -D <input.smi >fingerprints.txt
    ward -f 512 -g <fingerprints.txt >neighborlists.txt
    ward -C -c 10 -y <neighborlists.txt >clusters.10.txt
    ward -C -c 50 -y <neighborlists.txt >clusters.50.txt
    ward -C -c 100 -y <neighborlists.txt >clusters.100.txt
    
  6. Using the Kelley method for the optimization of the number of clusters.
    generatemd c input.smi -k CF -c cfp.xml -D -o fingerprints.txt
    ward -f 512 -g -K kelley.txt <fingerprints.txt >neighborlists.txt
    
    An example for the generated text file (kelley.txt):
    Kelley Indexes for All Cluster Levels
    
    level	index
    1	500.000
    2	261.018
    ...
    18	32.038
    ...
    498	499.000
    499	500.000
    
    Optimal number of clusters: 18
    
    Clustering using the suggested number of clusters and the generated RNN list. Singletons get negative cluster ids.
    ward -C -c 18 -y <neighborlists.txt >clusters.18.txt
    
  7. Displaying the structures of the first cluster using the CreateView and MarvinView applications:
    • Clustering:
      generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt
      ward -g -c 10 -f 512 < fingerprints.txt > clusters.txt
      
    • Creating an SDfile containing the structures from the first cluster (clid=1):
       
      crview -i id -c "clid=1" -s input.sdf -t clusters.txt >ward_result1.sdf
      
    • Displaying the structures and the NSC field (it comes from the original SDfile):
      mview -c 3 -r 3 -f NSC ward_result1.sdf
      
    A screenshot of MarvinView showing the cluster:

  8. Displaying the central objects of clusters that contain at least 20 compounds (size>=20) using the CreateView and MarvinView applications:
    • Clustering:
      generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt
      ward -g -c 10 -f 512 -x -z < fingerprints.txt > clusters.txt
      
    • Creating an SDfile containing central objects of the clusters satisfying the condition:
      crview  -i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt >ward_result1.sdf
      
    • Displaying the structures, the NSC field (comes from the original SDfile), and the cluster size (only for the central compounds):
      mview -c 3 -r 3 -f "NSC:clid:size" ward_result2.sdf
      
    A screenshot of MarvinView showing the central objects:

 

References

  1. Ward, J. H. Hierarchical Grouping to Optimize an Objective Function J. Am. Statist. Assoc. 1963, 58, 236-244
  2. Murtagh, F. A Review of Fast Techniques for Nearest Neighbour Searching. In Havranek et al. (eds.), COMPSTAT 84, Physica-Verlag, Vienna, 143-147, 1984
  3. El-Hamdouchi, A.; Willet, P. Hierarchic Document Clustering Using Ward's Method. In Proceedings of the Ninth International Conference on Research and Development in Information Retrieval, 149-156, 1986.
  4. Kelley LA, Gardner SP, Sutcliffe MJ. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 1996, 9, 1063-1065   Click here to download the paper.
 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.