FragmentStatistics creates statistical results from the output of Fragmenter. The simplest usage is to remove duplicate fragments and sort fragments by occurrence, but FragmentStatistics can also sort fragments by molecule activity or other data read from the input molecules and stored together with the generated fragments.
The input of FragmentStatistics is the output of Fragmenter in cxsmiles format with the following fields:
The output of FragmentStatistics is a sorted cxsmiles table with the following data:
Fragments are sorted by activity which is calculated in form of a scoring function:
acx*(w1*c1 + w2*c2 + ... + wN*cN)where:
ac is the heavy atom count
w1, w2, ..., wN are the category weights in descending order
(default: from +1 to -1, equidistant)
c1, c2, ..., cN are the fragment counts in each category,
in descending activity order
x is the exponent of the heavy atom count (default: 1)
If there is no activity data then FragmentStatistics simply removes fragment duplicates
and sorts fragments by acx*c1 where c1 is the
fragment count. By default the exponent is 1 and the score is thus
ac*c1.
If there are two activity categories then the default scoring function is
ac*(c1 - c2), if there are three categories, then it is
ac*(c1 - c3).
Examples
0.5:
command line: -c "0.5"
scale: ------------------------|-------------------- < 0.5: Inactive 0.5 >= 0.5: Active
| Name | Activity value | Weight |
|---|---|---|
| Active | >= 0.5 | +1 |
| Inactive | < 0.5 | -1 |
Scoring formula:
ac * (#(Active) - #(Inactive))
command line: -r "PIC NAN MIC MIL LESS INA"
| Name | Activity value | Weight |
|---|---|---|
| Picomolar inhibitor | PIC | +1 |
| Nanomolar inhibitor | NAN | +0.6 |
| Micromolar inhibitor | MIC | +0.2 |
| Millimolar inhibitor | MIL | -0.2 |
| Less than millimolar | LESS | -0.6 |
| Inactive | INA | -1 |
Scoring formula:
ac * (#(Picomolar) + 0.6* #(Nanomolar) + 0.2* #(Micromolar) -
- 0.2* #(Millimolar) - 0.6* #(Less than millimolar) - #(Inactive))
A set of working examples is also available.
fragstat [<options>] [<input file>]
Prepare the usage of the fragstat script or batch file
as described in Preparing the Usage of JChem
Batch Files and Shell Scripts.
Alternatively, the FragmentStatistics class can be directly invoked:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \
chemaxon.reaction.FragmentStatistics \
[<options>] [<input files/strings>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.reaction.FragmentStatistics \
[<options>] [<input files/strings>]
-h, --help this help message
-s, --help-scoring help on fragment scoring
-c, --cutoffs <cutoffs> category classes with continuous range
cutoffs: a list of cutoff values
defining activity intervals
(e.g. "0.5 2.3 5.3")
-r, --range <range> category classes with discrete range
range: a list of all possible values
in ascending activity order
(e.g. "0 1 2 3")
-t, --order-type <order> category activity order
a: ascending (the smaller the more active)
d: descending (the bigger the more active)
(default: d)
-a, --all-fragments display all fragments in output
(default: only actives in highest category)
-m, --min-atom-count minimum number of heavy atoms
required in a fragment
(default: 0 - no restriction)
-e, --default-activity <value> default activity value set for fragments
with no activity value
(default: skip these fragments)
-d, --display-header display table header in output
(otherwise displays only cxsmiles)
-o, --output <filepath> output file path (default: stdout)
FragmentStatistics takes its input from the cxsmiles output of
Fragmenter and sorts fragments
with duplicate filtering. If the fragmented molecules contain activity data
in an SDF field then Fragmenter can be run with
the --statistics parameter
to store this data in the created fragments. FragmentStatistics then can be used
to sort fragments by activity measured by a scoring function
(for help on scoring and its parameters, type fragstat -s).
Activity categories are determined either by cutoff values specified in the
--cutoffs parameter or else by the complete activity range specified
in the --range parameter.
In the former case the cutoff values determine a finite set of activity intervals
(the first and the last interval is half-infinite).
In the latter case the activity range is finite and the complete set of activity
values are listed in the --range parameter.
The command line parameter --order-type determines whether large
activity is described with big or small numerical activity data. By default,
FragmentStatistics takes bigger values as more active. This is the case when the
activity data determines the activity itself, the opposite should be specified if
the activity is given as minimal concentration needed to achieve some chemical effect.
If the command line parameter --all-fragments is specified then the output
contains all fragments, otherwise only actives appearing in the highest activity category
are included in the output. If there are no activity categories then all fragments are
regarded as active.
If the command line parameter --min-atom-count is specified then
fragments with less heavy atoms than this limit are excluded from the statistics.
If a default activity value is specified in the command line parameter
--default-activity then this activity value is set for fragments with no activity
value. If this parameter is omitted then fragments with no activity value are skipped.
If the command line parameter --display-header is specified then the output
table header is included in the output. This is useful when the output is read as a data
table, but in this case the output is not a cxsmiles molecule file and cannot be mview-ed.
The input is the cxsmiles output of Fragmenter with specific fields.
If no input file name or input string is specified in the command line then input is taken from the standard input.
FragmentStatistics writes output molecules in cxsmiles format
with specific fields and an optional header line.
If the --output is omitted, results are written to the standard output.
In the examples below, we first fragment the input molecules in
mols.sdf where activity data is given in the
ACTIVITY SDF field:
![]() |
We use Fragmenter.xml as Fragmenter configuration. Note, that we generate all fragments: there is no limit on the number of fragment sets or fragments per fragment set.
ACTIVITY
SDF field to fragments.cxsmiles:
fragment -c Fragmenter.xml -s ACTIVITY mols.sdf -o fragments.cxsmiles
Some of the generated fragments are shown below:
![]() |
fragstat fragments.cxsmiles -o sorted1.cxsmiles
The first four fragments are shown below:
![]() |
Observe, that big fragments take precedence.
acx*c1 scoring function.
Set the exponent in the -x parameter:
fragstat fragments.cxsmiles -x 0.4 -o sorted2.cxsmiles
The first four fragments are shown below:
![]() |
Observe, that smaller fragments with large occurrence are taken first.
c1)
by setting this exponent to 0:
fragstat fragments.cxsmiles -x 0 -o sorted3.cxsmiles
The first four fragments are shown below:
![]() |
Observe, that fragments with large occurrence are taken first irrespective of their heavy atom counts.
-m parameter to 3. In this way we sort fragments
by occurrence with skipping fragments with 1-2 heavy atoms:
fragstat fragments.cxsmiles -x 0 -m 3 -o sorted4.cxsmiles
The first four fragments are shown below:
![]() |
fragstat fragments.cxsmiles -c 1 -o stat1.cxsmiles
The first four fragments are shown below:
![]() |
3 categories,
only molecule 3 (bopindolol)
being in the active category, only molecule 4 (bornaprolol)
being in the inactive category, others in between:
fragstat fragments.cxsmiles -c "1 40" -o stat2.cxsmiles
The first four fragments are shown below:
![]() |
Note, that multiple cutoff values should be enclosed in quotes.
4 categories,
each molecule being in a separate category, specify discrete range:
fragstat fragments.cxsmiles -r "0.05 4 5 50" -o stat3.cxsmiles
The first four fragments are shown below:
![]() |
Note, that multiple cutoff values should be enclosed in quotes. In case of discrete range, category values are matched by exact string matching, activity values can also be letters or other non-numerical strings.