The purpose of this document is to demonstrate the usage of applications in the Screen package through examples. These examples range from simple to more complex ones and cover many aspects of virtual screening using ChemAxon's tools. Pharmacophore perception, molecular descriptor generation, screening and its optimization are illustrated.
In order to run these examples you need:
Start menu -> All Programs -> Accessories -> Command
Prompt or Start menu -> Run > cmd). Alternatively, you can
use a UNIX-like shell, Cygwin, top.cd examples/screenin UNIX, or
cd examples\screenin Windows.
This example demonstrates the perception of pharmacophore point types of
compounds using the PMapper program.
Input structures are retrieved from an SDfile (nci1000.sdf). Results will be stored in nci1000-pmap.sdf.
This output file will contain all original structures along with their
pharmacophore point map stored in the custom PMAP-FRAG tag.
Pharmacophoric point perception is based on a simple
rule-base which defines six different pharmacophore atom types: anionic,
cationic, hydrogen bond donor, hydrogen bond acceptor, hydrophobic and aromatic.
pmapper nci1000.sdf -c pharma-frag.xml -o nci1000-pmap.sdf -S -t PMAP-FRAG
The use and meaning of command-line flags in the above command:
| Option | Description | Default |
|---|---|---|
-c |
specifies the name of the XML configuration file | - |
-o |
specifies the name of the output file | standard output (terminal windows where pmapper runs) when omitted |
-S |
forces SDfile output | - |
-t |
specifies the name of the SDF tag where pharmacophore data is stored | PMAP |
The pharmacophore types are mapped to individual atoms. When displayed in MarvinView they can be colored according to the schema below:
| pharmacophore type |
display color |
pharmacophore type |
display color |
| cationic | anionic | ||
| aromatic | hydrophobic | ||
| donor | acceptor | ||
| acceptor and donor | anionic and acceptor | ||
| other combination | none |
This schema is defined in the XML configuration file, in the
ScreeningConfiguration
section.
To display structures using pharmacophore type coloring execute:
mview nci1000-pmap.sdf -t PMAP-FRAG -p PF.colors
This command will pop up a MarvinView application window and will display pharmacophore type colored structures.
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
-t |
specifies the name of the SDF tag where pharmacophore data is stored |
-p |
specifies the name of the field containing the coloring schema |
The real point of pharmacophore mapping is not pure visualization but the generation of pharmacophore fingerprints that can be used in virtual screening.
Further examples demonstrating the use of the PMapper tool are found in the corresponding user documentation.
Various molecular descriptors are generated by the GenerateMD program. The complete list of available descriptors can be obtained by:
generatemd -L
This command displays the name and abbreviation of all descriptors:
GenerateMD - Molecular Descriptor Generator 1.5.12, (C) 2002-2005 ChemAxon Ltd. Built in descriptor types: Chemical fingerprint (CF) Pharmacophore fingerprint (PF) BCUT descriptors (BCUT) Hydrogen bond donor/acceptor count (HDon/HAcc) octanol-water distribution coefficient (LogD) octanol-water partition coefficient (LogP) topological polar surface area (TPSA) mass of molecule (Mass) number of heavy atoms (Heavy)
Some examples are given below.
This example demonstrates the generation of topological pharmacophore
fingerprints for 1000 compounds taken from an SDfile (nci1000.sdf). Results
will be stored in nci1000-PF.sdf. This output file will contain every original
structures along with their pharmacophore point map (in the PMAP tag) as well
as the pharmacophore fingerprints generated (in the PF tag).
The progress of the descriptor generation will be reported in the terminal
window after each 100 molecular structures have been processed.
generatemd c nci1000.sdf -k PF -c pharma-frag.xml -o nci1000-pfp.sdf -S -v 100
The use and meaning of command-line flags in the above command:
| Option | Description | Default |
|---|---|---|
c |
is a command abbreviation which stands for 'create' | - |
-k |
specifies the type of the molecular
descriptor to be generated, PF stands for
PharmacophoreFingerprint |
- |
-c |
specifies the name of the XML configuration file | - |
-o |
specifies the name of the output file | standard output, when -o is omitted |
-S |
forces SDfile output | |
-v |
verbose mode, progress is reported after each 100 structures processed | 1000 |
The output file contains two new custom tags after each molecular structure:
> <PMAP> h;h;h;h;a;h;h;h;a > <PF> a a = | 0 0 0 0 1 0 0 0 0 0 |, h a = | 2 4 5 3 0 0 0 0 0 0 |, h h = | 7 8 5 1 0 0 0 0 0 0 |
Such fingerprints can directly be used in virtual screening.
This example demonstrates the generation of topological pharmacophore fingerprints for 1000 compounds taken from an SDfile (nci1000.sdf). Results will be stored in a descriptor file. This output file will not contain original structures only the descriptors generated. The program runs in quiet mode, no message is written about its progress.
generatemd c nci1000.sdf -k CF -c cfp.xml -o nci1000.cfp
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
c |
is a command abbreviation which stands for 'create' |
-k |
specifies the type of the molecular
descriptor to be generated, CF stands for
ChemicalFingerprint |
-c |
specifies the configuration file for the CF
descriptor |
Note, that the XML configuration file is optional for chemical fingerprint generation and it is omitted in the above example.
A portion of the output file is displayed below:

Such fingerprints can directly be used in virtual screening and also in clustering.
This example demonstrates the generation of a scalar descriptor, the topological polar surface area. 1000 compounds are taken from an SDfile (nci1000.sdf). Results will be stored in a descriptor file. This output file will not contain original structures only the descriptors generated. The program runs in quiet mode, no message is written about its progress.
generatemd c nci1000.sdf -k TPSA -o nci1000.tpsa
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
c |
is a command abbreviation which stands for 'create' |
-k |
specifies the type of the molecular
descriptor to be generated, TPSA stands for
Topological Polar Surface Area. |
Note, that an XML configuration file is not needed for the generation of most scalar descriptors.
Further examples demonstrating the use of the GenerateMD tool, including examples when structures are taken from database tables and descriptors are stored in the database too, are found in the corresponding user documentation.
ChemAxon's virtual screening application is called ScreenMD. This program recognizes a wide variety of molecular descriptors and can perform dissimilarity calculation using one or more descriptors at the same time.
This example demonstrates how chemical fingerprints generated in the above example and stored in a molecular descriptor file can be used in searching for compounds that are chemically (topologically) similar to a query compound (stored in a file as a smiles string):

screenmd ace-1.smiles -k nci1000.cfp
ace-1.smiles is the name of the query file which contains the
above displayed structure.
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
-k |
specifies the type of the molecular descriptor to be used, in this example it is not an abbreviation of a descriptor as listed above but the name of a descriptor file generated previously. Note, that such files store the original configuration used when they were generated. This assures that the same parametrization is used in screening too. |
The output is written into the terminal window where the command is executed.
Each row contains two number, the Tanimoto and the Euclidean dissimilarity
ratios between the query and one target structure. Note, that not all
dissimilarity values are displayed, only those that are lower than a predefined
dissimilarity threshold (this is 0.2 for the Tanimoto dissimilarity, and 10 in
the case of the Euclidean distance). If either values are below the threshold,
both are displayed, that is, the corresponding target structure is a virtual
hit.
A portion of the output of the above command is as
follows:
q1_CF_Tan q1_CF_Euc
0.66 9.70
0.48 8.43
0.58 9.49
0.52 9.54
0.52 9.75
0.62 9.80
0.56 9.00
0.61 9.95
0.64 9.54
0.62 10.00
0.61 9.75
0.59 9.38
0.51 9.33
0.49 8.66
0.45 8.72
0.38 8.06
0.49 9.64
0.50 9.22
0.50 9.22
0.50 9.22
0.50 9.22
0.50 9.22
0.50 9.22
0.50 9.22
0.35 7.87
To get a more useful output two more flags need to be used:
screenmd ace-1.smiles -k nci1000.cfp -M Euclidean -g
The use and meaning of these two command-line flags in the above command:
| Option | Description | Default |
|---|---|---|
-M |
selects the specified metric for dissimilarity calculations | all metrics in the configuration |
-g |
generates unique structures id-s and writes them into the output, may take an optional starting value | 1 |
The output written in the command window is:
id q1_CF_Euc 90 9.70 185 8.43 200 9.49 205 9.54 220 9.75 226 9.80 228 9.00 236 9.95 239 9.54 293 10.00 309 9.75 315 9.38 316 9.33 317 8.66 318 8.72 319 8.06 323 9.64 324 9.22 325 9.22 326 9.22 327 9.22 328 9.22 329 9.22 330 9.22 331 7.87 332 9.75
Note, that missing id values correspond to structures that did
not pass similarity screening, their dissimilarity was above the threshold for
acceptance.
This example demonstrates how pharmacophore fingerprints generated in the above example and stored in an SDfile can be used in searching for compounds that have similar pharmacophore as some given query compounds (stored in a file as a smiles strings). Four ACE inhibitors, displayed below, are used as query structures. (Note, that atoms are colored according to their pharmacophore type.)

screenmd nci1000-pfp.sdf ace-1-4.smiles -k PF -c pharma-frag.xml -t PF -g -o table hits.txt -o sdf hits.sdf
nci1000-pfp.sdf is the target file to be screened while
ace-1-4.smiles is the name of the query file which contains the
above displayed four structures.
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
-k |
specifies the type of the molecular descriptor to be used, in this example it is PF that abbreviates pharmacophore fingerprint |
-c |
specifies the XML configuration file associated with the current descriptor type |
-t |
tells screenmd to take the value of the descriptor from the
SDfile tag named here (note, that PF after -t is the
name of the tag and could be anything) |
-g |
generates unique structures id-s and writes them into the output |
-o |
specifies the type and name of the output file, table stands for the table of dissimilarity values (as seen above) while sdf produces a standard SDfile which contains some custom tags to store the dissimilarity values |
Virtual hits can be visualized with the MarvinView application. Dissimilarity
values are displayed below structures. Note, that there are 8 values
corresponding to each structure because four queries and two dissimilarity
metrics were applied.
Hits can displayed by:
mview -t PMAP -p PF.colors -f q1_PF_Tan:q1_PF_Euc:q2_PF_Tan:q2_PF_Euc:q3_PF_Tan:q3_PF_Euc:q4_PF_Tan:q4_PF_Euc hits.sdf
This hit-set contains all target structures that passed the threshold in a comparison with any of the queries by any of the metrics applied. This is the less restrictive behavior of screening and there are command line flags available to change this default mode.

A hypothesis can be build from the above 4 individual queries. The next example illustrates screening against a pharmacophore hypothesis.
screenmd nci1000-pfp.sdf ace-1-4.smiles -k PF -c pharma-frag.xml -t PF -H Median -g -o table hits.txt -o sdf hits.sdf
In this example a target compound is only compared against the hypothesis, though with both metrics applied, thus there are two dissimilarity values stored per hit structures.

Also note, that the number of hits decreases from 143 to 26.
This example demonstrates how various descriptors can be combined in virtual screening. Unlike in the two examples above, where descriptors were generated in separate steps, in this case all descriptors are generated during the virtual screening procedure ("on-the-fly") and are not stored anywhere.
screenmd nci1000.sdf ace-1.smiles -k PF -c pharma-frag.xml -k TPSA -k LogP -o hits.txt -g
In this example three different molecular descriptors, pharmacophore fingerprint, topological polar surface area and octanol-water partition coefficient are combined in a molecular descriptor set. One dissimilarity value is obtained for each target structure-query structure pair which combines individual dissimilarity ratios of component descriptor using Euclidean metric.
The first id-s and dissimilarity scores of the first 25 hits in the output:
id q1 9 23.19 21 7.19 110 6.71 145 6.13 210 25.43 218 1.40 233 0.96 242 25.96 248 6.92 249 11.22 259 13.17 270 6.52 284 2.50 297 11.14 337 24.87 365 2.80 372 2.35 400 1.06 403 28.48 416 4.32 446 12.14 467 2.59 472 15.73 473 4.53 491 9.53
Examples given here illustrate the use of automatic optimization to enhance the efficiency of virtual screening. The optimization tunes various parameter values introduced in parametrized metrics. The available variants of the two standard metrics, Tanimoto and Euclidean are listed below:
Prior to optimization, it is not known which parametrized metric will exhibit
the best performance, thus all metrics are tried to be optimized and the best
ones are chosen for screening. In the examples below the active family is a set
of ACE inhibitors. For training purposes a small subset of the target database
is used, in these examples 300 structures (nci300.smiles). The
known actives are divided into three parts, 4 structures are used as training
set (ace-1-4.smiles) and four others as queries
(ace-5-8.smiles). The third chunk is not used in optimization, it
is preserved for validation purposes.
optimizemetrics nci300.smiles ace-1-4.smiles ace-5-8.smiles -k PF -c pharma-frag.xml -M TanT Tanimoto -M TanA Tanimoto -a -M TanS Tanimoto -s -o pharma-frag-opti.xml -H Median -v
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
-k |
specifies the type of the molecular descriptor to be used, in this example it is PF that abbreviates pharmacophore fingerprint |
-c |
specifies the XML configuration file associated with the current descriptor type; in optimization it is the input configuration, and the result of the optimization process is written elsewhere, this file is not affected |
-M |
specifies a parametrized metric, it takes two mandatory
parameters: the name of the new parametrized metric (e.g. TanT and
TanA in this example), the name of the base metric (e.g.
Tanimoto) (note, that the base metric has to be defined in the
input configuration file specified after the -c flag). Five optional flags may follow the -M flag: -a,
-w, -s and -n, and these
can be combined (though not all combinations are valid, for instance Euclidean
cannot be scaled, see detailed definition
of parametrized metrics). |
-a |
asymmetric metric |
-w |
weighted (Euclidean) metric |
-s |
scaled (Tanimoto) metric |
-n |
normalized (Euclidean) metric |
-o |
specifies the name of the output XML configuration file which contains the entire input configuration plus all optimized parametrized metrics are added |
-H |
a hypothesis build form the individual queries is used in optimization, the default hypothesis type is Minimum |
Note, that if no parameter specific flag (a, w, n, s) is
given after the name and type of a parametrized metric, it is still valid (e.g.
TanT in the example above). In this case parameters are not
introduced and tuned, the basic metric is used but the dissimilarity threshold for
acceptance/rejection is determined (and store in the XML configuration).
Detailed explanation is found in the relevant documentation.
In the case of pharmacophore fingerprints, weighting can be applied
in two distinct forms: per
cell and per feature.
The command line flags are -W and -w respectively.
The optimizemetrics command writes the specified output XML
configuration file. Besides, some simple statistics about the performance of the
metrics optimized are written in the terminal window. These results, however,
have to be validated using an independent test set.
An independent blind test is carried out to validate the selectivity of optimized metrics. For this purpose a yet unused subset of the target structures and active structures is needed, but the same query set can be used.
hitstatistics nci700.smiles ace-9-16.smiles ace-5-8.smiles -o hitstat.txt -e 3 -H Median -b -k PF -c pharma-frag-opti.xml
The use and meaning of command-line flags in the above command:
| Option | Description |
|---|---|
-o |
specifies the name of the output file where the statistics are saved |
-e |
specifies the numeric precision of decimal numbers |
-H |
specifies the hypothesis type, this should match the one used in the optimization step |
-b |
statistics are calculated for each descriptor and metric separately |
-k |
specifies the descriptor type |
-c |
specifies the name of the configuration file, this has to
be the one generated by optimizemetrics |
The output file contains some general information about the input files and detailed statistical data about the performance of basic and optimized parametrized metrics. This file is best viewed in a spreadsheet program.
Each row contains statistics for one parametrized metric. Columns are:
The command above is fairly long, yet it does not exploit the full functionality of OptimizeMetrics. Before optimization it is not know which descriptor and dissimilarity metric combination will provide the best performance. Therefore it is reasonable to optimize for various descriptors and all possible metrics at the same time. These optimization goals do not change too frequently, only does the target molecule set and the set of known active structures. It is very straightforward to create a batch program (in MS-Windows) or a shell script (in UNIX variants) that collects all these common settings and takes the name of the input structure files only.
optimize nci1000.smiles ace.smiles
This command does all necessary preprocessing steps: cuts target file into
two (one part used in optimization, the other part kept for validation),
similarly, it cuts the active file into three (equal) parts using random
picking, then calls the optimizer for three types of molecular descriptors
(chemical fingerprint, pharmacophore fingerprint and fuzzy pharmacophore
fingerprint) with all available parameterized metrics. When optimization is
accomplished, it writes three XML configuration files that contain the optimized
parametrized metrics.
The output configurations are directly fed into the next stage, the validation
of metrics. Appropriate input files are automatically created and selected for
this purpose. Results are written in table format into the file
example_statistics.stat. The most relevant part of this file is
shown here (in a spreadsheet program, for easy viewing).
The highest enrichment value (and also the highest active hit distribution)
is achieved by the Euclideanwa metric (weighted, asymmetric
Euclidean) for pharmacophore fingerprint, thus this descriptor and this metric
should be used in the screening of the entire library for compounds matching the
ACE pharmacophore.