OptimizeMetrics is a tool used for preparing a screening of a database of compounds for matching a class of active molecules. It is for use before performing the screening to enhance its effectiveness, i.e. to reduce the number of hits from the database, while still keeping guaranteed, that 'good' hits (hits that are similar chemically or their pharmacophore is similar etc.) are retained. OptimizeMetrics is a preparation tool for the screening application ScreenMD. It sets up the dissimilarity metrics used by ScreenMD to compare molecular descriptors generated from the active and target compounds. The aim is to enhance their selectivity. This is done by optimizing the parameters of the dissimilarity metrics. The goal of optimization is to select minimal number of hits from the target database, while keeping maximal similarity between the hits and the actives.
OptimizeMetrics can optimize parameters of different metrics of several molecular descriptor types. The same descriptor types are supported as in ScreenMD, which are at present: 2 dimensional pharmacophore fingerprints and chemical fingerprints. Below the available dissimilarity metrics of these two descriptor types and their parameters are listed:
OptimizeMetrics can generate values for the following parameters: weights, asymmetry factor and scale factor for each of the above listed parametrized metrics. Euclidean metric of pharmacophore fingerprints can be set to be normalized, but, naturally, normalization is independent of the optimization. All parameters of the metrics are optimized aiming to enhance their performance when screening large databases of molecules against a set of actives, aiming to give better and fewer hits. Finally, reasonable thresholds are set for each dissimilarity metric, which can be used as cut-off values in the following screening to select hits from the target set of molecules (molecules for which dissimilarity from the actives is under the threshold are selected as hits).
As mentioned earlier, this tool was developed to be used before setting up screening of a larger database using ScreenMD. After the optimization it is also beneficial to perform a test screen on a smaller subset of molecules in order to obtain statistical information about the effectiveness of screening with the optimized metrics, this can be performed by using HitStatistics. It can be seen, that these three applications are close-knit, and this document assumes the knowledge of definitions and functions described in ScreenMD, and HitStatistics.
Summarizing, these tools are typically used in the following order:
A separate tool, ScreeningOptimizer unifies the functions of OptimizeMetrics and HitStatistics with simplified usage. It also prepares the random molecule sets required by these applications from the target set of molecules and the actives.
OptimizeMetrics generates values for the parameters of dissimilarity metrics by training them on three sets of molecules. The first set is the so-called target set: a possibly diverse set of molecules, selected from a larger database, for example a random selection from the final target of the screen. This set may contain few hundreds of molecules. To obtain a random selection of target molecules from a larger set, a small application is available1. The second set is the test set which contains molecules that are known to be similar to the actives. The third set is the query set of actives, containing molecules to which comparisons are performed either by using them to create a hypothesis, or by comparing one by one against each of them. A reasonable way to obtain the last two sets is to cut the original set of actives into two, and use the resulting sets as the test set and the query set. Section Optimization method describes how these three sets are used to optimize the parameters of dissimilarity metrics.
Besides that the molecular descriptors and the parametrized metrics, for which the parameters are to be optimized should be specified. How it should be done is described in the section Usage.
Parameters of one dissimilarity metric are optimized simultaneously, dissimilarity metrics are processed one by one. The optimization of different metrics may require various amount of time, depending on the number of parameters to optimize. For example, optimizing the asymmetry factor of an asymmetric Tanimoto metric (only one parameter) takes much less time, than the optimization of a weighted, asymmetric Euclidean metric (which may have over two hundred parameters).
All generated weights are in the range of 0 to 1. Scale factors can take values between 1 to 8, asymmetry factors are between 0 and 0.5. Each parameter belongs only to one dissimilarity metric, e.g. an asymmetric Tanimoto metric has a different asymmetry factor than an asymmetric Euclidean.
The specified parametrized metrics along with the calculated values of their parameters are added to the existing configuration of each molecular descriptor, and are written into the specified output files. All the optimized parametrized metrics can be used for screening in a following execution of HitStatistics or ScreenMD, if these output files are used there for configuration. Thresholds for the dissimilarity metrics are also added (which is usually necessary in a large screen), thus the parametrized metrics become ready for use in a larger screening.
The importance of optimizing the parameters of dissimilarity metrics can be demonstrated with comparisons to the performance of basic, non-parametrized metrics (see examples).
The optimization method performs numerous screens on the training set of molecules, described in the introduction. In each step of the optimization, with each setting of the parameters, it screens both the target set and the test set against the query set. Its goal is to find those values of the parameters, for which most of the molecules from the test set are selected as hits (they should be hits, because they are known to exhibit similar behavior), while the fewest possible molecules are selected from the target set (in order to select only the most similar ones). This screening is performed many times, this is why the size of the training sets should be limited, if results are to be obtained relatively fast: in our tests several hundred molecules were added to the target, and under hundred compounds to the test set and the query set.
At present, two goal functions (in our terminology, also called evaluator functions) are available to measure the effectiveness of a parameter setting. One is the widely used enrichment ratio, while the other, called selectivity effectiveness is an in-house development. Both functions are maximized, i.e. a parameter setting is considered to be better than the other, if the value of the goal function is greater for this setting. Formulas of these goal functions are given below.
E = Ha ( A + T ) / A (Ha + Ht)
where E is the enrichment ratio, A is the total number of molecules in the test set, T is the total number of target molecules, Ha is the number of hits from the test set, Ht is the number of target hits. Enrichment ratio measures the enrichment of the method if compared to random selection. It is a rate telling how many times more likely is to select an active (a member of the test set) from the union of the test set and the target set (treated as the set of in-actives) by one choice with the given method, than by random selection.
SE = w Ha / A + (1 - w) (T - Ht) / T
where SE is the selectivity effectiveness, w is a weight between 0 and 1, which is used to influence the importance of the two parts (asymmetry factor). The first part of the function describes selectivity from the test set, the second describes contra-selectivity from the target set. The advantage of this function is that its value is limited to values between 0 and 1 and separates the measurement of selectivity for the two sets. Its disadvantage might be the acceptance of too many hits from the target set (as its size is usually much bigger, then the size of the test set, thus the demand to find many hits from the test set might be too high, resulting a huge number of hits from the target set). This problem can be treated by setting up the right weights for the two parts.
During the search values of the parameters are selected systematically from their range. With each parameter setting a screen is performed, and from all the screens the maximal goal function value and the corresponding parameter values are selected. At first each parameter is set to the middle-point of its range (e.g. 0.5 for weights), then the first parameter is changed on an equidistant scale between the end-points of its range (keeping the rest fixed), and the value for which the goal function value is maximal is selected. If the maximal goal function value can be reached with different parameter settings, then always the maximal parameter value is selected for further processing. Then the same variation is done with the second, third, etc. parameters. One iteration step is finished, when each parameter has been varied on its range. Iterations are performed until changes in the values of the parameters and in the maximal value of the goal function become negligible. If this stage is reached, then the distance between the evaluation points of the variable ranges is halved (i.e. for each variable the number of points for which the goal function is evaluated is doubled), and the above described iteration process is repeated. The algorithm terminates when the difference between optimal values found with the refined subdivision and the ones found by the previous coarser subdivision becomes negligible. This approach is similar to the techniques used in simulated annealing algorithms.
During the optimization it is always tested, that at least one weight should differ from zero. The value of other parameters is independent of the value of the rest of the parameters.
After the optimization an acceptance threshold is selected for each metric, which depends on the requirement of the user about the minimal number of hits from the test set. The threshold is selected the following way: the test set is screened with the optimized parametrized metric, and the dissimilarity value compared to which the required number of dissimilarity values is lower or equal, is selected as a threshold.
The above described method works for all parametrized metrics, irrespectively of their properties, e.g. their number of variables. It has a high probability to enhance the performance of a metric significantly, although, of course, it is not guaranteed to find the best possible parameter value, since it does not perform an analytical optimization (bearing in mind, that if the goal function is the enrichment ratio or the selectivity effectiveness, then it is theoretically impossible, as these are not continuous functions).
The output of optimization is one value for each parameter of each parametrized metric.
optimizemetrics <target file> <test file> <query file> [<options>]
optimizemetrics config <configuration file> [<general options>]
These two modes are not strictly exclusive, they can be mixed in various ways. Command line parameters can extend settings provided in the configuration file. File names can be specified in the command line even when parameters are defined in the configuration file, in this case the files defined in the command line are processed. However, this kind of usage is recommended only for expert users. Thus, the exact specification of the command line syntax is as follows:
optimizemetrics config <configuration file> <target file> \
<test file> <query file> [<options>]
Note, that when specified, the configuration file must be the first argument
after the optimizemetrics command in the command line. Similarly,
filenames are positional, if input is taken from file, filenames must follow
either the command name or the name of the configuration file. Also note, that
the order of the filenames is definite: first the target file is specified,
followed by the name of the test file, then the query file.
Prepare the usage of the optimizemetrics script or batch file
as described in Preparing the Usage of JChem
Batch Files and Shell Scripts.
Alternatively, the OptimizeMetrics class can be directly
invoked:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%"
chemaxon.descriptors.OptimizeMetrics
<target file> <test file> <query file> [<options>]
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%"
chemaxon.descriptors.OptimizeMetrics
config <configuration file> [<general options>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.descriptors.OptimizeMetrics <target file>\
<test file> <query file> [<options>]
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.descriptors.OptimizeMetrics \
config <configuration file> [<general options>]
Options and parameters can either be defined in the command line or be specified in an XML configuration file. The command line mode is more suitable for smaller experiments. In contrast to this, configuring OptimeizeMetrics from XML is convenient even for much larger virtual screening exercises. Although an example configuration file is available, users are not encouraged to write such configuration files manually, instead the use of the XML configuration editor is highly recommended.
General options:
-h, --help this help message
-x, --expert-help advanced options for expert users
-v, --verbose verbose
Output options:
-g, --generate-id [<first>]
generate unique structure identifiers
an optional value for the first ID can be given
-e, --precision <prec> number of decimal places after the decimal point
Descriptor options:
-k, --descriptor <type> <common descriptor options> <type specific flags>
use descriptors of the given type
according to the type specific flags (see below)
Supported types are: CF, PF
Common descriptor options:
-c, --config <configfile>
path and name of the XML configuration file
-o, --output <filepath> output xml file path and name
-t, --use-tag [<name>] use existing descriptor data
-M, --metric <name> <type-name> <metric specific flags>
configure a parametrized metric with name <name>
as specified in the metric specific flags.
Supported type names are: Tanimoto, Euclidean,
FBPA. This option may appear more than once.
Descriptor type specific flags:
Metric specific flags:
-w, --weights metric is weighted, generate optimized weights.
-W, --cell-weights weights for Euclidean metric of pharmacophore
fingerprints is optimized separately for each
fingerprint cell. Valid only for descriptor
PharmacophoreFingerprint and metric Euclidean.
-s, --scale-factor metric is scaled, generate optimized scale factor.
-a, --asymmetry-factor metric is asymmetric, generate optimized
asymmetry factor.
-n, --normalize normalize metric to have dissimilarity values
as floats between 0 and 1.
Similarity options:
-Q, --compare-queries compare against query descriptor sets
-H, --compare-hypothesis [<name> [C]]
compare against hypothesis <name>
Valid names are: Minimum, Average, Median.
Default hypothesis type is Minimum.
'C' indicates consensus fingerprint.
This flag may occur more than once with different
hypothesis types.
Parameter optimization options:
-p, --percentage <value> minimal percentage of similar molecules
required to be hits. Default value is 80.
-f, --evaluator-function <name> <asymmetry-factor>
use the specified evaluator function as the goal
function of optimization. Allowed types are:
Enrichment, SelectivityEffectiveness,
ActiveHitDistribution.
Asymmetry-factor should be specified only for
function SelectivityEffectiveness. Its default
value is 0.5 (no asymmetry).
Default function is Enrichment. Only one goal
can be specified at a time.
Advanced options for expert users:
SDfile options:
-I, --id-tag <name> name of the tag storing unique molecule identifiers
2D pharmacophore fingerprint options:
-P, --PMAP-tag [<name>] use existing PMAP data
Similarity options
-Z, --zero-threshold percentage threshold for zero limit in median
hypothesis
Output options:
-l, --split-output split output configuration file by metrics
Many of the options are the same, as in the applications
ScreenMD or HitStatistics,
for a detailed description of generic flags, molecular descriptor related flags,
similarity flags, some of the input/output flags see
ScreenMD options.
Specific options of this application are the output option (-o), the
metric option (-M) with its metric specific flags, and all the parameter
optimization options.
The output option defines the path and name of the XML file for each molecular decriptor, where the original configuration of the descriptor is written along with all the parameters of all specified dissimilarity metrics. These files are the main output of the program, as they pass the result of the optimization to the subsequent applications (statistical calculations, screening). If these files are used there for configuring, then all the optimized parametrized metrics can be used for screening.
If the advanced option --split-output is set, then the generated
parametrized metrics will also be stored in separate screening configuration
files (also in XML format). In this case file names are generated automatically
from the descriptor name and the metric name. These configuration files are
necessary, if by a following call to GenerateMD
the parametrized metrics set by the optimization are to be stored in a
database. This should be done in the case if the application using these
metrics is going to be the JSP example, developed by our company,
not only the command-line interfaces HitStatistics
or ScreenMD.
The metric flag defines the parametrized metrics which are to be optimized. It can be used several times for each molecular descriptor. For each parametrized metric the following settings are required: its name (defined by the user, e.g. WeightedEuclidean), the original dissimilarity metric (e.g. Euclidean), and the parameters to be added (e.g. weights). An example:
-M ST Tanimoto -s -M EWAN Euclidean -w -a -n
defines two parametrized metrics, called ST and EWAN: a scaled Tanimoto (scale factor is to be optimized) and a weighted asymmetric normalized Euclidean (weights and asymmetry factor are to be optimized). If no parameters are specified, then the given basic metric is added along with a threshold to the configuration.
For the Euclidean metric of pharmacophore fingerprints two weighting modes are available. If the -w flag is defined, then a faster optimization is performed: the number of weight variables is significantly reduced. In this case weights corresponding to each cell of the fingerprints are generated from a reduced number of weight variables. One variable is assigned to each pharmacophore feature and to each topological distance. Weights used in the calculation of the Euclidean distance between two pharmacophore fingerprints (a separate weight for each fingerprint cell) are generated by multiplying the corresponding feature weights (for both features in the feature pair) and the weight of the corresponding topological distance. This way the actual number of optimized variables is reduced significantly, therefore the optimization of the weighted Euclidean metric requires tangibly less time. If the -W flag is selected then weight variables are generated and optimized for each fingerprint cell separately. The output in both cases is the same: separate weights for each cell, only the internal optimization process is different.
There are two optimization specific flags: percentage and the selection of evaluator function. The percentage flag determines the minimal percent of molecules from the test set required to be found as hits by the optimizer. This is a constraint for the optimization: maximal value of the evaluation function is sought between the cases, when at least the required percentage of molecules from the test set (compared to the total number of molecules in the test set) are selected as hits. The evaluator function flag allows the user to select the goal function, which is maximized. At present, two functions are available, described earlier: the enrichment ratio and the selectivity effectiveness.
In the XML configuration file the same parameters can be defined. These options
appear in the config configuration editor labeled with the above
long forms of command line parameter names, with only small differences. In
most of the cases only the first letters of words are capitalized.
For example --compare-queries is displayed as
CompareQueries. In other cases, especially when the option has
parameters, instead of one edit field, a frame has to be filled in. For example,
--compare-hypothesis is exchanged with a frame, where all
the hypotheses can be specified with their type and a consensus can be selected.
The use of the configuration editor is very straightforward and simple.
Merging short forms of command line options is not supported (that is,
instead of -vQ the form -v -Q should be used).
Warning! To use OptimizeMetrics a valid license key is needed. When no valid license key is found in the home directory, OptimizeMetrics runs in demo mode, where the number of molecular descriptors to be processed is limited to 2000 (thus if several types of molecular descriptors are generated, then the number of structures may be limited to few hundreds).
Since OptimizeMetrics does not process a large amount of data, database connection is not supported yet.
OptimizeMetrics accepts molecular structure files on input. Three input files must be supplied: a file containing the selected target set of molecules, another containing the test set compounds known to be similar to the queries (e.g. a subset of the actives), and the file containing the queries (a subset of the actives, complementer to the test set).
Most molecular file formats are accepted ( MDL molfile, Compressed molfile, SDfile, Compressed SDfile, SMILES, etc.).
If the input file is an SDfile, it may already contain descriptors of
molecules. This information can either be used or ignored. The default
behavior of OptimizeMetrics is to ignore such information, in which case
descriptors are generated from the original molecular structures. This can be
overridden with the --use-tag flag, then descriptors stored in the
SDfile tags are used. The default SDfile tags for storing molecular descriptor
and related data are:
Other than default tag names can be specified with the --tag-name
option.
SDfiles containing descriptors can be generated with
GenerateMD. Existing descriptors are
worth being reused as doing so can reduce running times.
OptimizeMetrics writes the result parameter values into XML files separately
for each molecular descriptor. The names of the files should be
specified with the --output flag in each molecular descriptor
section. If the name is not specified, then results are written on the standard
output. The content of the output XML files is the same as in
the configuration file specified by the flag --config, completed
with the required parameter values in the form, which can be directly used in a
following screening or statistical calculation.
If required then the generated parametrized metrics are also be stored in separate screening XML configuration files. In this case file names are generated automatically from the descriptor name and the metric name. These configuration files are necessary, if by a following call to GenerateMD the parametrized metrics set by the optimization are to be stored in a database. This should be done in the case if the application using these metrics is going to be the JSP example, developed by our company, not only the command-line interfaces HitStatistics or ScreenMD.
Beside the XML configuration file that can be optionally used to specify parameter settings (see Usage), OptimizeMetrics takes mandatory configuration files too. These files correspond to molecular descriptors used for screening, there should be one file per descriptor.
Molecular descriptor settings are also defined in external text (XML) files, these settings are described in PMapper configuration and ScreenMD configuration. These configuration files can be edited by the Configuration Editor GUI, which alleviates the setup of the required parameters. There are sample configuration files available in the 'examples/config' directory (see pharmacophore fingerprint configuration, chemical fingerprint configuration and OptimizeMetrics configuration).
cd examples bin/optimizemetrics config config/optimizemetrics.xml -v
This example is available as an executable example in the
'examples/bin' directory (see optimizemetrics_example).
It generates a few parametrized metrics for chemical fingerprints, and many for
basic and fuzzy pharmacophore fingerprints. Input is 100 target molecules from
the NCI database, some β2 antagonists of the Adrenerg receptor
are used for the minimum hypothesis, to which comparisons are performed, others
are included in the test set.
cd test/pharmacophore optimizemetrics nci100.smiles\ beta2-agonists-test.sdf beta2-agonists-queries.sdf\ -k PF -c pharma-frag.xml -o beta2-agonists-pharma.xml\ -M WeightedEuclidean Euclidean -w
Since output precision is not specified, thus default precision is used. Other default settings applied: targets and test set molecules are compared against each query, minimal required percentage of test set hits is 80%, enrichment ratio is used as evaluator function.
optimizemetrics nci100.smiles\
beta2-agonists-test.sdf beta2-agonists-queries.sdf\
-k PF -c pharma-frag.xml -o beta2-agonists-pharma.xml\
-M Eucl Euclidean\
-M WEucl Euclidean -w\
-M AEucl Euclidean -a\
-M WAEucl Euclidean -w -a\
-M WANEucl Euclidean -w -a -n\
-M Tan Tanimoto\
-M STan Tanimoto -s\
-M ATan Tanimoto -a\
-M ASTan Tanimoto -a -s
9 parametrized metrics are defined: a basic Euclidean, with threshold set to match the requirement for minimal number of test set hits, a weighted Euclidean, an asymmetric Euclidean, a weighted and asymmetric Euclidean, a weighted, asymmetric and normalized Euclidean, a basic Tanimoto only with threshold value, a scaled Tanimoto, an asymmetric Tanimoto, and an asymmetric scaled Tanimoto. Naturally, thresholds are generated in each case. Asymmetry factors, weights and scale factors are optimized separately for each parametrized metric.
randomms nci10000.smiles nci100.smiles nci-rest.smiles -n 100 -v
optimizemetrics nci100.smiles\
beta2-agonists-test.sdf beta2-agonists-queries.sdf\
-k PF -c pharma-frag.xml -o beta2-agonists-pharma.xml\
-M WANEucl Euclidean -w -a -n\
-M ASTan Tanimoto -a -s\
-k CF -c cfp.xml -o beta2-agonists-CF.xml\
-M AEucl Euclidean -a\
-M ASTan Tanimoto -a -s
The file nci100.smiles is generated by RandomMS, by random selection of 100 molecules from nci10000.smiles. The remaining 9900 molecules are written into nci-rest.smiles. 2D pharmacophore fingerprints and hashed chemical fingerprints are generated for each molecule, and the specified parametrized metrics are optimized and added to the corresponding output XML files.
randomms nci10000.smiles nci100.smiles nci-rest.smiles -n 100 -v
optimizemetrics nci100.smiles\
beta2-agonists-test.sdf beta2-agonists-queries.sdf\
-H -v -e 3 -p 70 -f SelectivityEffectiveness 0.3 \
-k PF -c pharma-frag.xml -o beta2-agonists-pharma.xml\
-M WANEucl Euclidean -w -a -n\
-M ASTan Tanimoto -a -s\
-k CF -c cfp.xml -o beta2-agonists-CF.xml\
-M AEucl Euclidean -a\
-M ASTan Tanimoto -a -s
Output precision after the decimal point is set to 3, required percentage of test set hits is set to 70%, the evaluator function (goal) is selected to be 'selectivity effectiveness' (with asymmetry factor 0.3), and verbose mode is used.
randomms nci10000.smiles nci100.smiles nci-rest.smiles -n 100 -v
optimizemetrics nci100.smiles\
beta2-agonists-test.sdf beta2-agonists-queries.sdf\
-H -v -e 3 -p 70 -f SelectivityEffectiveness 0.3 \
-k PF -c pharma-frag.xml -o beta2-agonists-pharma.xml\
-M WANEucl Euclidean -w -a -n\
-M ASTan Tanimoto -a -s\
-k CF -c cfp.xml -o beta2-agonists-CF.xml\
-M AEucl Euclidean -a\
-M ASTan Tanimoto -a -s
-k PF -c pharma-frag.xml -o beta2-agonists-pharma-fuzzy.xml -z 0.4\
-M WEucl Euclidean -w \
-M Tan Tanimoto \
Further examples of the usage of HitStatistics demonstrate the importance of parameter optimization.
-h. It can be found in the 'examples/bin' directory.