chemaxon.clustering
Class LibraryMCS

java.lang.Object
  extended bychemaxon.clustering.LibraryMCS

public class LibraryMCS
extends java.lang.Object

The LibraryMCS class computes the maximum common substructure (MCS) of a set of compounds. It can suggest scaffolds of a library, in particular VHTS hit sets. Typical size of such input structure set is a few thousand molecules, but LibraryMCS can cope with 10,000s of molecules.
It is not one subgraph common to all or to the majority of input molecules that the algorithm determines but the set of the most frequently occuring common substructures. The more diverse the set to be analysed is the larger the number of the frequent common substructures is, while in case of a more focused set with limited structural diversity, the number of frequent common substructures is smaller.
The algorithm is capable of going one or more level further in this kind of scaffold analysis by finding the MCS of the frequent common substructures - and so on in a hierarchical manner.
Practically speaking structures are clustered based on their MCSs (not on their similarities etc.) in a hierarchical clustering procedure.
This class implements the ClusterEnumberator class which allows clients to retieve the hierarchy. The tree of clusters as well as data associated with nodes in this tree can be accessed along with various code values that help reconstruct the hierarchy in custom applications.
This class also provides a simple command line interface for batch processing of MCS search for a set of structures, as well as a simple graphical user interface for easy navigation through clusters of structures.

Since:
JChem 3.2
Version:
0.7
Author:
Miklos Vargyas

Nested Class Summary
 class LibraryMCS.ClusterEnumerator
          The ClusterEnumerator is the right way to obtain results of a LibraryMCS clustering.
 
Field Summary
static int ATOM_COUNT_UPPER_BOUND
          structures above this size are not searched for pair-wise mcs as it would take to long to calculate the MCS
static int DEFAULT_ALLOWED_LEVEL_COUNT
          maximum number of levels in the hierarchy
static boolean DEFAULT_ATOM_TYPE_MATCH
          atom types are matched by default
static boolean DEFAULT_BOND_TYPE_MATCH
          bond types are matched by default
static boolean DEFAULT_CHARGE_MATCH
          atom formal charges are matched by default
static boolean DEFAULT_KEEP_RINGS
          ring bonds are not broken (rings are kept intact) by default
static int DEFAULT_MCS_MODE
          default MCS search mode
static int DEFAULT_MIN_MCS_SIZE
          default MCS size limit, the algorithm does not search for an MCS below this limit
static int DEFAULT_REQUIRED_CLUSTER_COUNT
          minimum number of top-level clusters
static int MAX_LEVEL_COUNT
          maximum allowed number of hierarchy levels
static int TERMINATION_CANCEL
          last search terminated due to user cancellation
static int TERMINATION_CLUSTER_COUNT
          last search terminated because the required top level cluster count was reached
static int TERMINATION_ERROR
          last search terminated due to an error, solution is not found
static int TERMINATION_LEVEL_COUNT
          last search terminated because the predefined allowed level count was reached
static int TERMINATION_MCS_SIZE_LIMIT
          last search terminated becasue the allowed minimum MCS size was reached
static int TERMINATION_SAME_PARAMETERS
          last attempt to cluster one level failed as the clustering paramters were the same as one the last level
static int TERMINATION_STEP_NOT_ALLOWED
          invalid call of method step()
static int TERMINATION_UNKNOWN
          last search terminated for an unknown reason, solution may not be found
 
Constructor Summary
LibraryMCS()
          Creates an new LibraryMCS instance.
 
Method Summary
 void addMolecule(Molecule mol)
          Adds a new molecule to the set of structures to be clustered.
 LibraryMCS.ClusterEnumerator getClusterEnumerator(boolean leavesOnly)
          Gets a new LibraryMCS.ClusterEnumerator object.
 LibraryMCS.ClusterEnumerator getClusterEnumerator(boolean leavesOnly, boolean selectedOnly)
          Gets a new LibraryMCS.ClusterEnumerator object.
 int getInputStructureCount()
          Retrieves the total number of input structures clustered.
 int getLevelCount()
          Retrieves the total number of levels in the hierarchy.
 int getStopCause()
          Internal code of last termination condition.
 java.lang.String getStopCauseExplanation()
          Deatailed explanation why last search terminated.
 int getTopLevelClusterCount()
          Gets the number of clusters on the highest level of the hierarchy.
 int getTotalClusterCount()
          Gets the total number of clusters in the hierarchy.
static void main(java.lang.String[] args)
          Simple command line interface for batch processing.
 void reset()
          Resets the internal state to the initial values.
 boolean search()
          Performs hierarchical maximum common substructure search.
 void setAllowedLevelCount(int allowedLevelCount)
          Stes the maximum number of hierarchy levels allowed in clustering.
 void setAtomCountUpperBound(int atomCountUpperBound)
          Sets the maximum structure size for pairwise mcs search.
 void setAtomTypeMatch(boolean b)
          Sets the matching mode for atom types.
 void setBondTypeMatch(boolean b)
          Sets the matching mode for bond types.
 void setChargeMatch(boolean b)
          Sets the matching mode for atom formal charges.
 void setDissimCutoff(float dissimCutoff)
          Deprecated. This method has no affect from version 0.7 of LibraryMCS (JChem version 5.0.1) due to internal incompatibilities.
 void setFastSearch(boolean toggleFastSearch)
          Deprecated. Use setMCSMode(MCS.MODE_FAST) or setMCSMode(MCS.MODE_TURBO) instead, this method has no effect form version 5.0 of JChem.
 void setKeepRings(boolean b)
          Sets the matching mode for rings.
 void setMCSMode(int mode)
          Sets MCS search strategy.
 void setMCSSimilarityThreshold(float mcsSimilarityThreshold)
          Deprecated. The similarity threshold is not used from version 0.7 of LibraryMCS
 void setMinimalSimilarityMeasurement(float minSimilarity)
          Deprecated. Minimal similarity measurement is not used from version 0.7
 void setMinimumMCSSize(int mcsSize)
          Sets the minimum size of any MCS found.
 void setMode(int mode)
          Deprecated. Use setMCSMode instead.
 void setRequiredClusterCount(int requiredClusterCount)
          Sets the nember of required clusters on the top level of hierarchy.
 boolean step()
          Adds one more level to the exsisting cluster hierarchy.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_REQUIRED_CLUSTER_COUNT

public static final int DEFAULT_REQUIRED_CLUSTER_COUNT
minimum number of top-level clusters

See Also:
Constant Field Values

DEFAULT_ALLOWED_LEVEL_COUNT

public static final int DEFAULT_ALLOWED_LEVEL_COUNT
maximum number of levels in the hierarchy

See Also:
Constant Field Values

ATOM_COUNT_UPPER_BOUND

public static final int ATOM_COUNT_UPPER_BOUND
structures above this size are not searched for pair-wise mcs as it would take to long to calculate the MCS

See Also:
Constant Field Values

MAX_LEVEL_COUNT

public static final int MAX_LEVEL_COUNT
maximum allowed number of hierarchy levels

See Also:
Constant Field Values

DEFAULT_MCS_MODE

public static final int DEFAULT_MCS_MODE
default MCS search mode


DEFAULT_ATOM_TYPE_MATCH

public static final boolean DEFAULT_ATOM_TYPE_MATCH
atom types are matched by default

See Also:
Constant Field Values

DEFAULT_BOND_TYPE_MATCH

public static final boolean DEFAULT_BOND_TYPE_MATCH
bond types are matched by default

See Also:
Constant Field Values

DEFAULT_CHARGE_MATCH

public static final boolean DEFAULT_CHARGE_MATCH
atom formal charges are matched by default

See Also:
Constant Field Values

DEFAULT_KEEP_RINGS

public static final boolean DEFAULT_KEEP_RINGS
ring bonds are not broken (rings are kept intact) by default

See Also:
Constant Field Values

DEFAULT_MIN_MCS_SIZE

public static final int DEFAULT_MIN_MCS_SIZE
default MCS size limit, the algorithm does not search for an MCS below this limit

See Also:
Constant Field Values

TERMINATION_UNKNOWN

public static final int TERMINATION_UNKNOWN
last search terminated for an unknown reason, solution may not be found

See Also:
Constant Field Values

TERMINATION_ERROR

public static final int TERMINATION_ERROR
last search terminated due to an error, solution is not found

See Also:
Constant Field Values

TERMINATION_LEVEL_COUNT

public static final int TERMINATION_LEVEL_COUNT
last search terminated because the predefined allowed level count was reached

See Also:
Constant Field Values

TERMINATION_CLUSTER_COUNT

public static final int TERMINATION_CLUSTER_COUNT
last search terminated because the required top level cluster count was reached

See Also:
Constant Field Values

TERMINATION_MCS_SIZE_LIMIT

public static final int TERMINATION_MCS_SIZE_LIMIT
last search terminated becasue the allowed minimum MCS size was reached

See Also:
Constant Field Values

TERMINATION_CANCEL

public static final int TERMINATION_CANCEL
last search terminated due to user cancellation

See Also:
Constant Field Values

TERMINATION_SAME_PARAMETERS

public static final int TERMINATION_SAME_PARAMETERS
last attempt to cluster one level failed as the clustering paramters were the same as one the last level

See Also:
Constant Field Values

TERMINATION_STEP_NOT_ALLOWED

public static final int TERMINATION_STEP_NOT_ALLOWED
invalid call of method step()

See Also:
Constant Field Values
Constructor Detail

LibraryMCS

public LibraryMCS()
Creates an new LibraryMCS instance. It is an empty chemical space that is ready to take structures to be clustered.

Method Detail

reset

public void reset()
Resets the internal state to the initial values. Note, that it does not clear the chemical space, that is, input structures that were added previously (and clustered) are not removed, clusters are deleted. This allows running clustering from scratch but without the need to import and add input molecules again.
Typically, parameters are changed before reclustering.


setDissimCutoff

public void setDissimCutoff(float dissimCutoff)
Deprecated. This method has no affect from version 0.7 of LibraryMCS (JChem version 5.0.1) due to internal incompatibilities.

Sets the dissimilarity cutoff.

Parameters:
dissimCutoff - dissimilarity threshold for MCS search

setMCSSimilarityThreshold

public void setMCSSimilarityThreshold(float mcsSimilarityThreshold)
Deprecated. The similarity threshold is not used from version 0.7 of LibraryMCS

Sets the minimum similarity threshold for the pairwise MCS calculations.

Parameters:
mcsSimilarityThreshold -

setMinimalSimilarityMeasurement

public void setMinimalSimilarityMeasurement(float minSimilarity)
Deprecated. Minimal similarity measurement is not used from version 0.7

Sets the algorithm termination condition to test the minimal similarity of clusters. Clustering is carried on while there are compounds whose similarity is larger than the limit set by this method.

Parameters:
minSimilarity -

setRequiredClusterCount

public void setRequiredClusterCount(int requiredClusterCount)
Sets the nember of required clusters on the top level of hierarchy. Search terminates if there the number of clusters on the highest level of the hierarchy is less than this limit.

Parameters:
requiredClusterCount - number of top level clusters

setAllowedLevelCount

public void setAllowedLevelCount(int allowedLevelCount)
Stes the maximum number of hierarchy levels allowed in clustering. Clustering terminates when the hierarchy has this many levels.

Parameters:
allowedLevelCount - number of hierarchy levels allowed (tree depth)

setAtomCountUpperBound

public void setAtomCountUpperBound(int atomCountUpperBound)
Sets the maximum structure size for pairwise mcs search. Sructures above this size are not selected for a pair-wise mcs search. This limit has strong effect on the results as well as on the total running time. MCS search for larger structure (e.g. above 40 atoms) can be slow.

Parameters:
atomCountUpperBound -

setFastSearch

public void setFastSearch(boolean toggleFastSearch)
Deprecated. Use setMCSMode(MCS.MODE_FAST) or setMCSMode(MCS.MODE_TURBO) instead, this method has no effect form version 5.0 of JChem.

Toggles fast search mode. Fast search mode is provided by pair-wise MCS (@see chemaxon.sss.search.MCS} and it is not guaranteed to give optimal solution - though work much faster. The good practice is to obtain first insight into the set of structures by fast search, then apply the more rigorous but slower version if the quick evaluation looks promising.

Parameters:
toggleFastSearch - turn fast search on/off

setMCSMode

public void setMCSMode(int mode)
Sets MCS search strategy. Allowed values are MCS.MODE_EXACT, MCS.MODE_FAST and MCS.MODE_TURBO.

Parameters:
mode - mode flag

setMode

public void setMode(int mode)
Deprecated. Use setMCSMode instead.

Sets MCS search strategy.

Parameters:
mode -

setMinimumMCSSize

public void setMinimumMCSSize(int mcsSize)
Sets the minimum size of any MCS found. MCSs below this size limit are ignored.

Parameters:
mcsSize - minimum required size of any MCS

setAtomTypeMatch

public void setAtomTypeMatch(boolean b)
Sets the matching mode for atom types. Atom types can either be considered (checked) or ignored when two molecules are searched for an MCS.

Parameters:
b - flags if atom types are considered (true) or ignored (false)

setBondTypeMatch

public void setBondTypeMatch(boolean b)
Sets the matching mode for bond types. Bond types can either be considered (checked) or ignored when two molecules are searched for an MCS.

Parameters:
b - flags if bond types are considered (true) or ignored (false)

setChargeMatch

public void setChargeMatch(boolean b)
Sets the matching mode for atom formal charges. Charges can either be considered (checked) or ignored when two molecules are searched for an MCS.

Parameters:
b - flags if atom charges are considered (true) or ignored (false)

setKeepRings

public void setKeepRings(boolean b)
Sets the matching mode for rings. Rings bond can either be broken in which case ring bonds can match chain bonds (or ring bonds of different size, like 6 member ring bond against 5 membered ring bonds) or kep

Parameters:
b - flags if atom types are considered (true) or ignored (false)

addMolecule

public void addMolecule(Molecule mol)
Adds a new molecule to the set of structures to be clustered. The input molecule has to be standardized which should include aromatization if bond types are considered in matching. Similarly, if MCS search includes the matching of hybridization states then they have to be computed by MoleculeGraph.calcHybridization() before calling this function.

Parameters:
mol - a molecular structure

search

public boolean search()
               throws java.lang.InterruptedException
Performs hierarchical maximum common substructure search. The search terminates if either of the conditions below hold: When search() terminates method getStopCause() can be invoked to get the termination code (see constants TERMINATION* ).

Returns:
indicates is a solution was found not (ie. at least one MCS was found and a cluster was successfully formed)
Throws:
java.lang.InterruptedException

step

public boolean step()
Adds one more level to the exsisting cluster hierarchy. Method search() must be called prior to this method and it has to return true. Beside, clustering options (typically, the allowed minimum size of the MCS, see setMinimumMCSSize(int)) must be changed, otherwise step() has no effect (since termination conditions were reached when previous search() terminated).

Returns:
true if one more level was successfully added to the exsisting cluster hierarchy

getClusterEnumerator

public LibraryMCS.ClusterEnumerator getClusterEnumerator(boolean leavesOnly)
Gets a new LibraryMCS.ClusterEnumerator object.

Parameters:
leavesOnly - leaf nodes or all clusters are enumerated
Returns:
the initialized enumerator

getClusterEnumerator

public LibraryMCS.ClusterEnumerator getClusterEnumerator(boolean leavesOnly,
                                                         boolean selectedOnly)
Gets a new LibraryMCS.ClusterEnumerator object.

Parameters:
leavesOnly - leaf nodes or all clusters are enumerated
selectedOnly - selected clusters and leaf nodes are listed
Returns:
the initialized enumerator

getStopCause

public int getStopCause()
Internal code of last termination condition. This can be called after search() or step().

Returns:
code of last termination condition, see TERMINATION*

getStopCauseExplanation

public java.lang.String getStopCauseExplanation()
Deatailed explanation why last search terminated. This can be called after search() or step().

Returns:
text explaining why the search algorithm terminated

getInputStructureCount

public int getInputStructureCount()
Retrieves the total number of input structures clustered.

Returns:
number of clusters in the lowest level of the hierarchy

getLevelCount

public int getLevelCount()
Retrieves the total number of levels in the hierarchy.

Returns:
number of hierarchy levels

getTotalClusterCount

public int getTotalClusterCount()
Gets the total number of clusters in the hierarchy. Leaf nodes sotring the input structures are not considered, only higher level nodes that represent real clusters. Singletons are included.

Returns:
number of clusters

getTopLevelClusterCount

public int getTopLevelClusterCount()
Gets the number of clusters on the highest level of the hierarchy. Singletons are included.

Returns:
number of clusters on the top hierarchy level

main

public static void main(java.lang.String[] args)
Simple command line interface for batch processing. Run this class by the -h flag in its commandline to get a brief list of command line syntax and options available.

Parameters:
args - command line arguments