ScreeningOptimizer

Version 5.0.6

Contents

 

Introduction

ScreeningOptimizer combines the functionality of the applications OptimizeMetrics (optimizer) and HitStatistics (statistics calculator). This document assumes that the reader is familiar with these tools. It was developed to supply the user with an easy-to-handle tool that prepares everything needed for setting up the final screening of a large database. It alleviates the usage of these two pre-screening tools by setting automatically their input. Consequently, it also allows less control for the user.

ScreeningOptimizer generates automatically the molecule set files required by OptimizeMetrics and HitStatistics from two molecule sets: the target set of molecules to be screened and the set of actives against which the screening is performed. It generates the query files, test files and target files for the optimizer and statistics calculator by random selection from the two given molecular files. Target sets are generated from the original target set, query sets and test sets are generated from the actives. The target set is divided into two: a small part of it (e.g. a few hundred) goes into the target of the optimization, the rest into the target of statistics calculator. The user can specify, how many should go into the target set of the optimizer, bearing in mind, that during the optimization of one parametrized metric several thousand screenings may be performed, thus the size of the optimization target set influences dramatically the time required for optimization. The set of actives is divided into three parts: one part forms the query set (both for the optimizer and the statistic calculator), another the target set of the optimizer, the third part is used for the target set of the statistics calculator. The division of the actives can be given by the user by defining the percentage of molecules going into each set (e.g. 32% into the query set, 33% into the target set of the optimizer, 35% into the target set of the statistics calculator). Disjoint sets are generated in order to obtain real statistical results: test or target molecules used in the optimization should not be reused in the statistical calculations, the statistical test should be absolutely 'blind'.

ScreeningOptimizer can generate more than one random test sets from the original set of targets and actives in the way described above. The user can specify how many sets should be generated, and which of these should be used for optimization and obtaining statistics. ScreeningOptimizer performs the optimization and statistical calculations on each selected molecular sets. During execution the input molecular files for the optimizer and statistics calculator are generated automatically, as well as the XML output files of the optimizer and the text files containing the statistical results. Naming of these files is described in section Output.

It can be chosen by the user before execution whether to use the existing random test sets from a previous run or generate new ones (the test sets with the pre-defined naming convention must exist). This option is available to allow the user to compare the results of several runs, in which case the same test cases are necessary.

 

Usage

Parameters for ScreeningOptimizer can be set only in an XML configuration file, it takes only a few general command line flags, program setup parameters are all taken from the configuration file:

    screeningoptimizer <configuration file> [<general options>]

Note, that the configuration file must be the first argument after the screeningoptimizer command in the command line.

Prepare the usage of the screeningoptimizer script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Alternatively, the ScreeningOptimizer class can be directly invoked:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%"
        chemaxon.descriptors.ScreeningOptimizer 
        <configuration file> [<general options>]

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
        chemaxon.descriptors.ScreeningOptimizer \
        <configuration file> [<general options>]

Options

Only the general options can be defined in the command line, the rest of the parameters should be specified in an XML configuration file. There are only two options:

General options: 
  -h, --help               this help message
  -v, --verbose            verbose

Although an example configuration file is available, users are not encouraged to write such configuration files manually, instead the use of the XML configuration editor is highly recommended.

The configuration of ScreeningOptimizer consists of three main parts. The first section is Input, which contains the specification of input and output directories and the file names of the molecular input files. These file names must be specified without path (in the tags TargetFile and ActiveFile), the path of the input molecular files should be specified at the InputDir tag. Output files are written into the directory specified with the OutputDir tag. The generated random selection molecular files are stored in the directory RandomSetDir.

The second section is RandomSets. If NewRandomSets is set to true, then new random molecule sets are generated and stored in the specified random set directory, otherwise already existing files are taken from the same directory. Naming of the generated files is described in section Output.

Number of random sets to be generated is given by the NumberOfRandomSets tag. Division of the target file to the target of the optimizer and the statistics calculator is defined by NumberOfTargetTestStructures. The number of molecules specified here goes into the target of the optimizer, the rest forms the target of the statistics calculator. ActiveSetDivision specifies how the set of actives should be divided with the following syntax: <test in stat>/<test in opt>/<query in opt and stat>, e.g. 35/33/32 defines, that 35% of molecules should go into the test set of the statistical calculator, 33% into the test set of the optimizer, 32% into the query set. SelectedSets defines which test cases should be processed.

In section Configuration the configuration files of OptimizeMetrics and HitStatistics should be given with their path. These configuration files are the same that can be used in the command line of these two tools. Section Input and the OutputFile tags for each descriptor are omitted in the OptimizeMetrics configuration, as these are generated automatically by ScreeningOptimizer. The same applies to the sections Input and Descriptors in the configuration of HitStatistics.

Warning! To use ScreeningOptimizer a valid license key is needed. When no valid license key is found in the home directory, ScreeningOptimizer runs in demo mode, where the number of molecular descriptors to be processed is limited to 2000 (thus if several types of molecular descriptors are generated, then the number of structures may be limited to few hundreds).

 

Database Connections

Since ScreeningOptimizer does not process a large amount of data, database connection is not supported yet.

 

Input

ScreeningOptimizer accepts molecular structure files on input. Names of these files should be specified in the configuration file. Two input files must be supplied: a file containing the selected target set of molecules and a file containing the actives.

Most molecular file formats are accepted ( MDL molfile, Compressed molfile, SDfile, Compressed SDfile, SMILES, etc.).

 

Output

There are several output files since this program combines the functionality of two other tools. If the option of random molecule set generation is selected then the random molecule files are generated in the directory specified by the user. These file names are generated the following way (where i stands for the index of the random test files, starting from 1):

  1. opt-i-<target file name> the i-th target for the optimizer
  2. hit-i-<target file name> the i-th target for statistics calculator
  3. opt-query-i-<active file name> the i-th query file for the optimizer and the statistics calculator
  4. opt-test-i-<active file name> the i-th test file for the optimizer
  5. hit-i-<active file name> the i-th test file for the statistics calculator

The output of optimization are XML files containing the setup of parametrized metrics for each molecular descriptor type separately (see also output of OptimizeMetrics). File names are generated as follows:

opt-i-<descriptor type>-<target file name>-<active filename>.xml

where i stands for the index of the processed random test files.

The output of statistical calculations are text files containing the statistical information about the performance of the parametrized metrics in a table format (see also output of HitStatistics). File names are generated as follows:

i-<target file name>-<active filename>.stat

where i stands for the index of the processed random test files.

 

Configuration File

Beside the XML configuration file that is used to specify parameter settings (see Options), ScreeningOptimizer takes other configuration files required by the optimizer (see OptimizeMetrics configuration) and the statistics calculator (see HitStatistics configuration). The names of the XML configuration files containing parameters are specified int the configuration file of ScreeningOptimizer (see Options).

These configuration files can be edited by the Configuration Editor GUI. There are pre-prepared configuration files available in the 'examples/config' directory (see pharmacophore fingerprint configuration, chemical fingerprint configuration, OptimizeMetrics configuration, HitStatistics configuration, ScreeningOptimizer configuration).

 

Examples

  1. A UNIX command1 using the configuration file screeningoptimizer.xml from the directory config:
    cd examples
    bin/screeningoptimizer config/screeningoptimizer.xml
    
  2. Same as above, only in verbose mode:
    cd examples
    bin/screeningoptimizer config/screeningoptimizer.xml -v
    

    This example is available as an executable example in the 'examples/bin' directory (see screeningoptimizer_example). 500 target molecules are taken from the NCI database, from which 50 are used for optimizing the parameters, the rest of molecules are for statistical calculations. Actives are the β2 antagonists of the Adrenerg receptor, from which 35% are used for statistical testing, 33% form the test set of optimization, the rest is used for generating the hypothesis. 5 random test sets are generated, one of them is processed.

Notes

  1. The same examples work on Windows too, though backslash characters should be included instead of forward slash into the command.
 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.