🏠 ProteinDJ > Bindsweeper User Guide

BindSweeper User Guide

BindSweeper is a Python CLI tool that automates parameter sweeps for ProteinDJ/RFdiffusion workflows. It allows you to systematically explore different parameter combinations for protein binding analysis and design generation.

Overview

BindSweeper works by:

Reading sweep configuration from YAML files
Generating parameter combinations to test
Creating Nextflow profiles for each combination
Executing the ProteinDJ pipeline with different parameters
Processing and organizing results

Quick Start

Installation

Please note: BindSweeper requires that ProteinDJ is installed and configured. If you have not installed ProteinDJ yet, follow the instructions here

Install uv (if not already installed):

# On Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or using pip
pip install uv

Install BindSweeper:

# Navigate to the bindsweeper subdirectory inside the ProteinDJ installation directory
cd <path-to-proteindj>/bindsweeper

# Install BindSweeper globally
uv tool install .
cd ..

# Verify installation
bindsweeper --help

Updating BindSweeper:

# To update bindsweeper to the latest version use the following commands
git pull
cd bindsweeper && uv tool install --reinstall-package bindsweeper . && cd ..

Basic Usage

Create a sweep configuration file (e.g., sweep.yaml)
Run the sweep:

bindsweeper

Running on WEHI Systems

When running BindSweeper on WEHI systems, use screen to prevent disconnections during long-running jobs:

Start a screen session:
```
screen -S bindsweeper_run
```

Run BindSweeper within the screen session:

cd /path/to/your/work/directory
bindsweeper --config sweep.yaml --output-dir results/

Detach from screen (keeps the job running):
- Press Ctrl+A, then D
Reattach to the screen session:
```
screen -r bindsweeper_run
```
List all screen sessions:
```
screen -ls
```

Terminate a screen session (when job is complete):

# From within the screen session
exit

# Or kill from outside
screen -X -S bindsweeper_run quit

Configuration Files

BindSweeper uses YAML configuration files to define parameter sweeps. Here are the key components:

Basic Structure

mode: binder_denovo  # or binder_foldcond
profile: milton      # Nextflow profile to use

# Parameters that remain constant across all runs
fixed_params:
  input_pdb: "/path/to/protein.pdb"
  rfd_noise_scale: 0.0

# Parameters to sweep across different values
sweep_params:
  parameter_name:
    type: range        # or values
    min: 0.0          # for range type
    max: 1.0
    step: 0.1
  # or
  other_parameter:
    values:           # for values type
      - "value1"
      - "value2"

# Results processing configuration
results_config:
  rank_dirname: results
  results_dirname: best_designs
  csv_filename: best_designs.csv
  output_csv: sweep_results.csv
  pdb_output_dir: sweep_designs
  zip_results: true

Supported Modes

binder_denovo: De novo binder design
binder_foldcond: Fold-conditioned binder design

Parameter Types

Range Parameters

sweep_params:
  rfd_noise_scale:
    type: range
    min: 0.0
    max: 0.1
    step: 0.05

Value List Parameters

sweep_params:
  hotspot_residues:
    values:
      - "A56,A115,A123"
      - "A56,A115"
      - "A56"

Paired Parameters

Paired parameters allow you to sweep multiple parameters in lock-step (zipped), rather than as a Cartesian product. This is useful when parameters are inherently linked — for example, each target PDB has a corresponding MSA file.

sweep_params:
  uncropped_target_pdb:
    values:
      - "input/protein1.pdb"
      - "input/protein2.pdb"
      - "input/protein3.pdb"
    paired_with:
      boltz_msa_path:
        - "input/msas/protein1.a3m"
        - "input/msas/protein2.a3m"
        - "input/msas/protein3.a3m"

Key behaviours:

All lists in paired_with must have the same length as the primary values list
Paired values are zipped (not crossed): the first PDB always runs with the first MSA, etc.
You can pair multiple secondary parameters at once — just add more keys under paired_with
Paired parameters are combined via Cartesian product with any other (non-paired) sweep parameters
A paired parameter cannot also appear as a separate sweep parameter or a fixed parameter

Example with paired + unpaired:

With 3 paired targets and 2 noise scale values, BindSweeper generates 3 × 2 = 6 combinations:

sweep_params:
  uncropped_target_pdb:
    values: ["protein1.pdb", "protein2.pdb", "protein3.pdb"]
    paired_with:
      boltz_msa_path: ["protein1.a3m", "protein2.a3m", "protein3.a3m"]
  rfd_noise_scale:
    values: [0.0, 0.1]

This produces:

Combination	`uncropped_target_pdb`	`boltz_msa_path`	`rfd_noise_scale`
1	protein1.pdb	protein1.a3m	0.0
2	protein1.pdb	protein1.a3m	0.1
3	protein2.pdb	protein2.a3m	0.0
4	protein2.pdb	protein2.a3m	0.1
5	protein3.pdb	protein3.a3m	0.0
6	protein3.pdb	protein3.a3m	0.1

Example Configurations

1. Hotspots Sweep

Test different hotspot combinations for binder design:

mode: binder_denovo
profile: milton

fixed_params:
  design_length: "60-100"
  input_pdb: "./benchmarkdata/5o45_pd-l1.pdb"
  rfd_noise_scale: 1.0
  rfd_ckpt_override: "complex_beta"

  af2_max_pae_interaction: 5
  af2_max_rmsd_binder_bndaln: 1
  af2_min_plddt_overall: 80

sweep_params:
  hotspot_residues:
    values:
      - null
      - "A56"
      - "A56,A115,A123"

2. Noise Scale Sweep

Explore different noise levels:

mode: binder_denovo
profile: milton

fixed_params:
  design_length: "60-100"
  input_pdb: "./benchmarkdata/5o45_pd-l1.pdb"
  hotspot_residues: "A56,A115,A123"
  rfd_ckpt_override: "complex_beta"

sweep_params:
  rfd_noise_scale:
    type: range
    min: 0.0
    max: 0.1
    step: 0.05

3. Scaffold Directory Sweep

Test different scaffold sets:

mode: binder_foldcond
profile: milton

fixed_params:
  input_pdb: "./benchmarkdata/5o45_pd-l1.pdb"
  hotspot_residues: "A56,A115,A123"
  rfd_noise_scale: 0.0

sweep_params:
  rfd_scaffold_dir:
    values:
      - "./binderscaffolds/scaffolds_100_EHEEHE"
      - "./binderscaffolds/scaffolds_100_HEEHE"
      - "./binderscaffolds/scaffolds_100_HHH"
      - "./binderscaffolds/scaffolds_100_HHHH"

4. Multi-Target Paired Sweep

Sweep across multiple targets, each with a corresponding MSA file:

mode: bindcraft_denovo
profile: milton

fixed_params:
  skip_fold_seq: true
  pred_method: "boltz"

sweep_params:
  uncropped_target_pdb:
    values:
      - "input/protein1.pdb"
      - "input/protein2.pdb"
      - "input/protein3.pdb"
    paired_with:
      boltz_msa_path:
        - "input/msas/protein1.a3m"
        - "input/msas/protein2.a3m"
        - "input/msas/protein3.a3m"

Command Line Options

Basic Options

--help: Show help message
--version: Show version information
--debug: Enable debug logging
--dry-run: Print commands without executing them

File and Directory Options

--config PATH: Path to sweep configuration YAML file
--output-dir PATH: Output directory for results
--pipeline-path PATH: Path to the ProteinDJ main.nf file
--nextflow-config PATH: Path to nextflow.config file

Execution Options

--skip-sweep: Skip parameter sweep and only process results
--continue-on-error: Continue if individual parameter sweeps fail
--resume: Add -resume flag to Nextflow commands to use cached tasks where inputs haven't changed
--parallel: Execute parameter combinations in parallel (each with isolated Nextflow cache)
--max-parallel N: Maximum number of parallel Nextflow runs (default: 4)
--quick-test: Run quick test with reduced parameters first
--auto-update: Automatically sync/update dependencies

Automation Options

-y, --yes-to-all: Skip confirming files automatically

Usage Examples

Basic Sweep

bindsweeper --config sweep.yaml --output-dir ./results

Quick Test First

bindsweeper --quick-test --config sweep.yaml

Quick Test Design Counts

When using --quick-test, BindSweeper reduces the number of designs to validate configurations efficiently:

Designs per combination: 2 designs
Sequences per design: 2 sequences
Total sequences per combination: 4 sequences

For the standard quick test configurations:

Noise sweep: 3 combinations → 12 total sequences (3 × 4)
Hotspots sweep: 3 combinations → 12 total sequences (3 × 4)
Scaffold sweep: 3 combinations → 12 total sequences (3 × 4)
Multi-dimensional sweep: 4 combinations → 16 total sequences (4 × 4)
Paired target sweep: N paired targets → N × 4 total sequences (e.g., 3 targets → 12 sequences)
Paired + unpaired sweep: N paired × M unpaired → N × M × 4 total sequences

This allows rapid validation of your parameter sweep configuration before committing to a full run with the default number of designs.

Dry Run (Preview Commands)

bindsweeper --dry-run --config sweep.yaml

Debug Mode

bindsweeper --debug --config sweep.yaml

Continue on Errors

bindsweeper --continue-on-error --config sweep.yaml

Resume Interrupted Sweeps

bindsweeper --resume --config sweep.yaml

When using --resume, BindSweeper adds the -resume flag to all Nextflow commands. This enables Nextflow's caching mechanism, which:

Skips tasks that have already completed successfully
Re-runs only tasks where inputs, parameters, or scripts have changed
Automatically detects parameter changes and re-executes affected tasks
Preserves computational resources by avoiding redundant work

Use cases for --resume:

Interrupted runs: Cluster timeouts, manual cancellation, or system failures
Iterative development: Testing bug fixes in later pipeline stages while reusing early stage results
Parameter refinement: Re-running with modified filtering thresholds while keeping expensive fold/sequence generation cached

Important notes:

Nextflow determines what to cache based on task hashes (inputs, scripts, parameters, containers)
If you modify any parameters (fixed or swept), Nextflow will automatically detect this and re-run affected tasks
The .nextflow/cache/ and work/ directories must be preserved for resume to work
Resume works at the task level within each parameter combination, not at the combination level

Parallel Execution

# Execute up to 4 combinations in parallel (default)
bindsweeper --parallel --config sweep.yaml

# Control the maximum number of parallel runs
bindsweeper --parallel --max-parallel 8 --config sweep.yaml

# Combine with resume for robust parallel execution
bindsweeper --parallel --resume --max-parallel 6 --config sweep.yaml

When using --parallel, BindSweeper executes multiple parameter combinations concurrently, which:

Runs multiple independent Nextflow pipelines simultaneously
Provides isolated cache directories for each combination (prevents cache conflicts)
Improves overall throughput on systems with available GPU and CPU resources
Each Nextflow run still internally parallelizes tasks as normal
Maintains proper resource allocation through the cluster scheduler

Benefits of parallel execution:

Faster completion: Leverage multiple GPUs and CPUs concurrently across different parameter combinations
Better resource utilization: Keep GPUs busy while other combinations process CPU tasks
Fault tolerance: Failed combinations don't block others from completing
Natural batching: Combinations complete and release resources as they finish

Use cases for --parallel:

Large parameter sweeps: When testing 8+ parameter combinations
GPU-rich clusters: Systems with multiple GPUs available for concurrent use
Mixed GPU/CPU workloads: Combinations naturally interleave GPU and CPU-intensive stages
Time-sensitive projects: Need results faster than sequential execution allows

Resource considerations:

--max-parallel should be ≤ number of available GPUs to prevent GPU contention
Each combination spawns its own Nextflow session with internal task parallelization
Monitor cluster queue status to ensure combinations get scheduled efficiently
Disk I/O can become a bottleneck with too many parallel runs
Consider available memory: multiple AF2/Boltz runs require significant RAM per GPU

Important notes:

Each combination uses isolated cache in <output_dir>/.nextflow_cache/
Combinations are truly independent - no shared state or locks
Works seamlessly with --resume flag for robust parallel execution
Log output is captured per combination in respective output directories
Works with --quick-test flag for rapid validation of large parameter sweeps

File Structure

BindSweeper generates organised output directories:

results/
├── combination1_param1_val1_param2_val2/
│   ├── results/
│   ├── best_designs/
│   └── logs/
├── combination2_param1_val3_param2_val4/
│   └── ...
├── sweep_results.csv        # Combined results
├── sweep_designs/          # Best designs from all runs
└── bindsweeper.log        # Execution log

Results Processing

BindSweeper automatically:

Collects results from all parameter combinations
Ranks designs based on metrics
Copies best designs to a central directory
Generates summary CSV files
Optionally creates ZIP archives

Tips and Best Practices

Start with Quick Tests: Use --quick-test to validate configuration
Use Dry Runs: Preview commands with --dry-run before execution
Use Resume for Long Runs: Always use --resume for multi-hour sweeps to recover from interruptions
Leverage Parallel Execution: Use --parallel for large sweeps on GPU-rich clusters to improve throughput
Monitor Resources: Large parameter sweeps can be resource-intensive, especially when running in parallel
Optimize --max-parallel: Set to match available GPUs (e.g., --max-parallel 8 on systems with 8+ GPUs)
Combine Resume and Parallel: Use both --resume --parallel for robust and efficient large-scale sweeps
Organize Results: Use descriptive output directory names
Check Dependencies: Ensure ProteinDJ and required tools are installed
Preserve Cache Directories: Keep .nextflow/ and work/ directories to enable resume functionality
Monitor Parallel Runs: Check cluster queue to ensure combinations are being scheduled appropriately

Troubleshooting

Common Issues

Missing nextflow.config: Ensure the file exists in current or parent directory
Invalid parameters: Check YAML syntax and parameter names
Resource constraints: Monitor system resources during execution
Path issues: Use absolute paths for input files
Resume not working: Ensure .nextflow/cache/ and work/ directories exist and haven't been cleaned
Unexpected re-execution with resume: Nextflow detects input/parameter/script changes and correctly re-runs affected tasks

Debug Information

Enable debug logging to see detailed execution information:

bindsweeper --debug

Getting Help

For issues or questions:

Check the log files in the output directory
Use --debug flag for detailed information
Review configuration file syntax
Ensure all dependencies are installed

Advanced Usage

Custom Pipeline Paths

bindsweeper --pipeline-path /custom/path/main.nf

Processing Existing Results

bindsweeper --skip-sweep --output-dir existing_results/

⬅️ Back to Main README

FilesExpand file tree

bindsweeper.md

Latest commit

History