Skip to content

4. Input Files, Memory Objects & Further Tools

cherhaus edited this page Aug 30, 2013 · 1 revision

Input File Format

The ParaSim file format for the query and reference fingerprints is quite flexible in handling different types of data. It is a tab-delimited plain text format (Windows or Linux style) with one row for each structure containing and at least two columns:

  • A unique alphanumeric row/structure identifier
  • The fingerprint bitset encoded in the common Base64 string format

A headline containing column identifiers and describing the fingerprint type is mandatory. This fingerprint description is used by ParaSim to check if fingerprint types in the query and reference data sets are identical. Examples are 'FCFP_6' or 'FEATMORGAN_3' as used in the demo files. A more descriptive appended '_BASE64' (or prepended 'BASE64_') is tolerated but not mandatory and will be ignored during comparison of the fingerprint types. The name of the structure identifier is detected by the file parser but so far this information is not used.

For downwards compatibility with versions earlier then 0.05, ParaSim also accepts an additional column 'BITCOUNT' or 'POPCOUNT', containing the number of 'on' bits in the respective fingerprint as an integer. If this column is present, it must be column 2 of the file, between the ID and the fingerprint column. Since v0.05, this column is no longer mandatory in order to facilitate the generation of ParaSim input files with third party tools. However, it speeds up reading of ParaSim files if a bitcount column is present.

A typical input file therefore looks like the following:

CID     BITCOUNT     FCFP_6_BASE64
68664   52      AwIDARAAAAAAAIAAAAAAAAAEACAABgAAEAAA [...]
68938   56      CxIBCZAAAAIBAAAEAAABAAAggAAABgAAQIBA [...]
[...]

Similarly, also the following is valid since v0.05 which is the minimum requirement for running ParaSim:

CID     FCFP_6_BASE64
68664   AwIDARAAAAAAAIAAAAAAAAAEACAABgAAEAAA [...]
68938   CxIBCZAAAAIBAAAEAAABAAAggAAABgAAQIBA [...]
[...]

The size of the fingerprint bitset (and Base64 string) is not fixed. This implies that the fingerprint bitset size has to be the same for query as well as for reference fingerprint files which is checked by ParaSim when the reference file is loaded.

Since v0.05, ParaSim supports reading at least one additional data column, e.g. for the storage of Smiles strings for direct output together with calculated similarity coefficients. In fact, parasim.pl and fp2mem.pl will read all additional columns including the headlines and will merge them by tabs into one column.

Therefore, since v0.05, also this is a valid file format:

CID     BITCOUNT     FCFP_6_BASE64        SMILES
68664   52    AwIDARAAAAAAAIAAAA [...]    O=C(c1oc(c2c1)cccc2)N3CCN(Cc4ccccc4)CC3
68938   56    CxIBCZAAAAIBAAAEAA [...]    CC(NCC1(c2ccccc2)CCN(CC3Oc(c4OC3)cccc4)CC1)=O

As more than one additional data column is tolerated, also this is valid:

CID     FCFP_6_BASE64              SMILES           SELECTED
68664   AwIDARAAAAAAAIAAAA [...]   O=C(c1oc [...]   yes
68938   CxIBCZAAAAIBAAAEAA [...]   CC(NCC1( [...]   no

In this case, the bitcounts will be calculated by ParaSim and the two additional data columns will be concatenated to one by tabs. So, a typical ParaSim output for this file as a reference dataset would be:

QUERY   REFERENCE       DICE         SMILES<tab>SELECTED
71923   68664   0.285714285714286    O=C(c1oc [...]<tab>yes
71360   68938   0.347107438016529    CC(NCC1( [...]<tab>no

ParaSim files can be either plain text (.txt) or compressed gzip (.txt.gz) format. Filename wildcards are extrapolated to multiple files but need to be quoted. Example query and reference files are packaged together with the ParaSim script itself in the data/ subdirectory.


Persistent Memory Objects

As a special feature, ParaSim makes use of pre-stored persistent memory objects. This is because, for large data sets, reading of input files from disk becomes the performance-limiting step in comparison to pure calculation times. This is particularly true for repeated queries against the same set(s) of data.

For that purpose, a supportive tool for ParaSim is available, fp2mem.pl, which reads a reference fingerprint file and stores it persistently in RAM. Memory consumption is about 100 MB per 1 million of fingerprints of length 1024. Parallel storage of several memory objects is possible which are identified and addressed by an integer key. fp2mem.pl can also be used to retrieve information about all stored memory objects on a machine as well as to destroy a particular memory object identified by its key.

To access a memory object which was generated with fp2mem.pl as a reference dataset with ParaSim, use the ParaSim option -r (to define the reference set) together with the keyword mem: combined with the integer key of the object you want to use, i.e.parasim.pl -r mem:7. This will trigger ParaSim to read all reference fingerprint information directly from that particular memory object with key 7 and will significantly increase the return time for calculation results.

For creation of a memory object, fp2mem.pl reads a valid ParaSim fingerprint file. Creation is triggered using option -create together with a numeric key which can be selected from a limited range of allowed integer values (default: 0-10) in order to avoid exhaustive consumption of memory.

Information about stored datasets can be reviewed together with all information about the originator, the source file and the fingerprint type applying option -info for information about all datasets or again in combination with an integer key for one particular dataset. Similarly, options -destroy and -dump, in combination with an integer key, remove a dataset from memory or dump it's content to stdout (for debugging/testing only).

It may be useful to trigger regular updating of a frequently used reference data set in memory by a cron job. For that purpose, option -force was added to prevent fp2mem.pl from requesting for confirmation for overwriting an existing memory object. For the same purpose, option -silent suppresses all output of progess information.

fp2mem.pl options summary:

-info [#key]                       Output information about all existing memory objects.
                                   Optionally, output information for one object identified by #key.
-destroy #key                      Destroy the memory object identified by #key.
-dump #key                         For testing only: Dumps the mem object's content to stdout.
-create #key                       Create the memory object identified by #key. Requires option -file.
    -file fingerprints.txt[.gz]    Used together with -create: The file containing the fingerprint data.
                                   Wildcards are expanded but have to be quoted.
-force                             Force deletion or recreation of existing memory object without confirmation.
                                   CAUTION: This will overwrite all existing content of this object!
-silent                            Suppress progress information output for -create or -destroy.
-help/h                            Show this help.

**Technical note:**The integer keys provided by the user are not used as they are but are converted internally to a numerical key which is unique for each machine. The reason is that all ParaSim-related tools need to identify the same memory objects from the same keys, but the key structure should not be too simple so that they may get mixed up with keys potentially used by other applications.


How to use the Tools shipped together with ParaSim

Together with ParaSim, several additional tools are packaged to facilitate the application of ParaSim and to demonstrate possible use cases. Some tools wrap pre-installed third party software for calculation of fingerprints. So, query or reference files for ParaSim can be generated directly from available structure files (SDF or Smiles).

rdkit2parasim.py

This script expects a running installation of Python and RDKit. It converts an SDF or Smiles file (also gz-compressed) into a ParaSim fingerprint input file. If the script's default parameters are used, it requires source and destination filenames as arguments and, in the case of SDF files, the name of a property containing the unique alphanumeric structure IDs. For regular Smiles files containing only two columns without column names, the ID column will be named 'Index' or, if an ID column is not present, the column 'Index' will be generated with numeric IDs. So far, the RDKit implementations of Morgan fingerprints and feature-based Morgan fingerprints with different radii can be generated.

In order to check if the script runs correctly, try

python rdkit2parasim.py pubchem-test.sdf dest.txt CID

or

python rdkit2parasim.py pubchem-test.smi dest.txt

The content of file dest.txt should be identical to the provided file pubchem-test-featmorgan3.txt.

Options:

positional arguments:
  source       A valid Smiles string or the path to the source file. Can be a
               .sdf[.gz] or .smi file. Smiles files require to be without
               title column in the format Smiles, Space, ID.
  destination  Path to the destination file. Will be a tabbed .txt[.gz] file

optional arguments:
  -h, --help   show this help message and exit
  -id ID       Name of a property containing a unique structure identifier.
               For Smiles files, an existing ID column will be renamed
               respectively. If not set or set to 'Index', a new numeric
               property 'Index' will be created during runtime.
  -fp FP       RDKit fingerprint to be used. Allowed values: MORGAN_X or
               FEATMORGAN_X with X being an integer > 0. DEFAULT: FEATMORGAN_3
  -l LENGTH    Length of the fingerprint in bits. Must be a multiple of 8.
               DEFAULT: 1024
  -smi         Add an additional Smiles column. Required when generating
               ParaSim reference files if ParaSim output is supposed to
               contain Smiles already.
  -v           Verbose: Print additional status information.

Molecule2Parasim.xml

This is a protocol for Pipeline Pilot™. It can be run either by importing it directly into a Pipeline Pilot™ client window or by calling it through another supportive script, simsearch.pl. Therefore, it requires a running Pipeline Pilot™ server (tested with version 8.5) which needs to be accessible via http to be called by parasim.pl. Make sure that you properly set the execution path for anonymous user access to Pipeline Pilot™ protocols in parasim-config.txt. The protocol reads molecules from SDF or Smiles files (also gz-compressed) and converts them either to FCFP or ECFP fingerprints of radius 2,4,6,8,10 or 12.

In order to check if Pipeline Pilot™ settings are set correctly for access by simsearch.pl , try:

perl simsearch.pl -fp FCFP_6 -q data/pubchem-test.sdf -r data/zinc-test-fcfp6.txt -id CID

For the Smiles input version, the ID column will be named 'Index' or, if an ID column is not present, the column 'Index' will be generated with numeric IDs:

perl simsearch.pl -fp FCFP_6 -q data/pubchem-test.smi -r data/zinc-test-fcfp6.txt

In both cases, output should be:

QUERY   REFERENCE       TANIMOTO        AVG_TANIMOTO
68664   ZINC01914437    0.198019801980198       0.104496307506587
68938   ZINC03774999    0.160377358490566       0.122158970101436
71360   ZINC03775002    0.133333333333333       0.103979492391050
71696   ZINC03774999    0.163636363636364       0.118017086925888
71917   ZINC03774999    0.147368421052632       0.102165139370256
71107   ZINC03774999    0.173076923076923       0.128406853662191
71542   ZINC01914437    0.185185185185185       0.107759423159295
71227   ZINC03774999    0.181818181818182       0.129684949182247
71767   ZINC03775009    0.174418604651163       0.122120643622887
71923   ZINC03774991    0.154761904761905       0.117569042869504

### parasim-conversion-knime-demo.zip

This example workflow demonstrates how in principal ParaSim input files can be generated with the OpenSource workflow engine KNIME (http://www.knime.org/) applying either RDKit or CDK fingerprints. Before using it, make sure you have the required Knime packages installed.

**Caution:**As the internal calculations applied within KNIME may differ from the implementations in the Perl or Python scripts, fingerprint files generated with KNIME may be different to those generated with the scripts. Therefore, only use fingerprint input files from the same source.

simsearch.pl

This is the most powerful supportive tool for ParaSim as it integrates the generation of fingerprint files either with RDKit or with Pipeline Pilot™ and the similarity search done with ParaSim itself. Therefore it allows similarity search against pre-computed reference fingerprint files directly from SDF or Smiles query files.

As a wrapper script, simsearch.pl combines the functionalities and parameter sets of the three wrapped scripts. In addition to the already described ParaSim parameters, additional parameters are required for the fingerprint type to generate (option -fp) and, for SDF files, the input file property which contains the unique structure ID (option -id). For the full list of the combined set of options, use perl simsearch.pl -h.

Simsearch.pl accepts SDF and Smiles files, also gz-compressed. For common Smiles files which only contain two columns without column names, one for the Smiles code and one for the ID, the ID data field name needs to be "Data" for use with Pipeline Pilot™ and "_Name" for use with RDKit.

**Initialisation:**If you want to start similarity searches directly from SDF or Smiles files using simsearch.pl, fingerprints and ParaSim input files need to be generated during runtime using either RDKit (through rdkit2parasim.py) or PipelinePilot™ (through Molecule2ParaSim.xml). Therefore, paths to the executables and scripts need to be defined in the paths section of parasim-config.txt.

The functionality check for Pipeline Pilot™ fingerprints was described above. In order to check if it runs correctly for RDKit fingerprints, try:

perl simsearch.pl -fp featmorgan_3 -q data/pubchem-test.sdf -id CID -r data/zinc-test-featmorgan3.txt

Output should be:

QUERY   REFERENCE       TANIMOTO        AVG_TANIMOTO
68664   ZINC03775002    0.181818181818182       0.116428576403331
68938   ZINC03774999    0.146788990825688       0.121462650379503
71696   ZINC03774999    0.168141592920354       0.108370896568424
71360   ZINC03774991    0.125000000000000       0.101284467917619
71542   ZINC01914437    0.228571428571429       0.141443815964918
71917   ZINC01914437    0.135416666666667       0.101810400382934
71227   ZINC03775002    0.191304347826087       0.133582252553683
71107   ZINC03774999    0.216981132075472       0.124678368523578
71767   ZINC03774991    0.116504854368932       0.096029465692885
71923   ZINC03774999    0.144230769230769       0.121297586262549

Clone this wiki locally