🏠 ProteinDJ > Bindsweeper User Guide
BindSweeper is a Python CLI tool that automates parameter sweeps for ProteinDJ/RFdiffusion workflows. It allows you to systematically explore different parameter combinations for protein binding analysis and design generation.
BindSweeper works by:
- Reading sweep configuration from YAML files
- Generating parameter combinations to test
- Creating Nextflow profiles for each combination
- Executing the ProteinDJ pipeline with different parameters
- Processing and organizing results
Please note: BindSweeper requires that ProteinDJ is installed and configured. If you have not installed ProteinDJ yet, follow the instructions here
-
Install uv (if not already installed):
# On Linux/macOS curl -LsSf https://astral.sh/uv/install.sh | sh # Or using pip pip install uv
-
Install BindSweeper:
# Navigate to the bindsweeper subdirectory inside the ProteinDJ installation directory cd <path-to-proteindj>/bindsweeper # Install BindSweeper globally uv tool install . cd .. # Verify installation bindsweeper --help
-
Updating BindSweeper:
# To update bindsweeper to the latest version use the following commands git pull cd bindsweeper && uv tool install --reinstall-package bindsweeper . && cd ..
- Create a sweep configuration file (e.g.,
sweep.yaml) - Run the sweep:
bindsweeperWhen running BindSweeper on WEHI systems, use screen to prevent disconnections during long-running jobs:
-
Start a screen session:
screen -S bindsweeper_run
-
Run BindSweeper within the screen session:
cd /path/to/your/work/directory bindsweeper --config sweep.yaml --output-dir results/ -
Detach from screen (keeps the job running):
- Press
Ctrl+A, thenD
- Press
-
Reattach to the screen session:
screen -r bindsweeper_run
-
List all screen sessions:
screen -ls
-
Terminate a screen session (when job is complete):
# From within the screen session exit # Or kill from outside screen -X -S bindsweeper_run quit
BindSweeper uses YAML configuration files to define parameter sweeps. Here are the key components:
mode: binder_denovo # or binder_foldcond
profile: milton # Nextflow profile to use
# Parameters that remain constant across all runs
fixed_params:
input_pdb: "/path/to/protein.pdb"
rfd_noise_scale: 0.0
# Parameters to sweep across different values
sweep_params:
parameter_name:
type: range # or values
min: 0.0 # for range type
max: 1.0
step: 0.1
# or
other_parameter:
values: # for values type
- "value1"
- "value2"
# Results processing configuration
results_config:
rank_dirname: results
results_dirname: best_designs
csv_filename: best_designs.csv
output_csv: sweep_results.csv
pdb_output_dir: sweep_designs
zip_results: truebinder_denovo: De novo binder designbinder_foldcond: Fold-conditioned binder design
sweep_params:
rfd_noise_scale:
type: range
min: 0.0
max: 0.1
step: 0.05sweep_params:
hotspot_residues:
values:
- "A56,A115,A123"
- "A56,A115"
- "A56"Paired parameters allow you to sweep multiple parameters in lock-step (zipped), rather than as a Cartesian product. This is useful when parameters are inherently linked — for example, each target PDB has a corresponding MSA file.
sweep_params:
uncropped_target_pdb:
values:
- "input/protein1.pdb"
- "input/protein2.pdb"
- "input/protein3.pdb"
paired_with:
boltz_msa_path:
- "input/msas/protein1.a3m"
- "input/msas/protein2.a3m"
- "input/msas/protein3.a3m"Key behaviours:
- All lists in
paired_withmust have the same length as the primaryvalueslist - Paired values are zipped (not crossed): the first PDB always runs with the first MSA, etc.
- You can pair multiple secondary parameters at once — just add more keys under
paired_with - Paired parameters are combined via Cartesian product with any other (non-paired) sweep parameters
- A paired parameter cannot also appear as a separate sweep parameter or a fixed parameter
Example with paired + unpaired:
With 3 paired targets and 2 noise scale values, BindSweeper generates 3 × 2 = 6 combinations:
sweep_params:
uncropped_target_pdb:
values: ["protein1.pdb", "protein2.pdb", "protein3.pdb"]
paired_with:
boltz_msa_path: ["protein1.a3m", "protein2.a3m", "protein3.a3m"]
rfd_noise_scale:
values: [0.0, 0.1]This produces:
| Combination | uncropped_target_pdb |
boltz_msa_path |
rfd_noise_scale |
|---|---|---|---|
| 1 | protein1.pdb | protein1.a3m | 0.0 |
| 2 | protein1.pdb | protein1.a3m | 0.1 |
| 3 | protein2.pdb | protein2.a3m | 0.0 |
| 4 | protein2.pdb | protein2.a3m | 0.1 |
| 5 | protein3.pdb | protein3.a3m | 0.0 |
| 6 | protein3.pdb | protein3.a3m | 0.1 |
Test different hotspot combinations for binder design:
mode: binder_denovo
profile: milton
fixed_params:
design_length: "60-100"
input_pdb: "./benchmarkdata/5o45_pd-l1.pdb"
rfd_noise_scale: 1.0
rfd_ckpt_override: "complex_beta"
af2_max_pae_interaction: 5
af2_max_rmsd_binder_bndaln: 1
af2_min_plddt_overall: 80
sweep_params:
hotspot_residues:
values:
- null
- "A56"
- "A56,A115,A123"Explore different noise levels:
mode: binder_denovo
profile: milton
fixed_params:
design_length: "60-100"
input_pdb: "./benchmarkdata/5o45_pd-l1.pdb"
hotspot_residues: "A56,A115,A123"
rfd_ckpt_override: "complex_beta"
sweep_params:
rfd_noise_scale:
type: range
min: 0.0
max: 0.1
step: 0.05Test different scaffold sets:
mode: binder_foldcond
profile: milton
fixed_params:
input_pdb: "./benchmarkdata/5o45_pd-l1.pdb"
hotspot_residues: "A56,A115,A123"
rfd_noise_scale: 0.0
sweep_params:
rfd_scaffold_dir:
values:
- "./binderscaffolds/scaffolds_100_EHEEHE"
- "./binderscaffolds/scaffolds_100_HEEHE"
- "./binderscaffolds/scaffolds_100_HHH"
- "./binderscaffolds/scaffolds_100_HHHH"Sweep across multiple targets, each with a corresponding MSA file:
mode: bindcraft_denovo
profile: milton
fixed_params:
skip_fold_seq: true
pred_method: "boltz"
sweep_params:
uncropped_target_pdb:
values:
- "input/protein1.pdb"
- "input/protein2.pdb"
- "input/protein3.pdb"
paired_with:
boltz_msa_path:
- "input/msas/protein1.a3m"
- "input/msas/protein2.a3m"
- "input/msas/protein3.a3m"--help: Show help message--version: Show version information--debug: Enable debug logging--dry-run: Print commands without executing them
--config PATH: Path to sweep configuration YAML file--output-dir PATH: Output directory for results--pipeline-path PATH: Path to the ProteinDJ main.nf file--nextflow-config PATH: Path to nextflow.config file
--skip-sweep: Skip parameter sweep and only process results--continue-on-error: Continue if individual parameter sweeps fail--resume: Add -resume flag to Nextflow commands to use cached tasks where inputs haven't changed--parallel: Execute parameter combinations in parallel (each with isolated Nextflow cache)--max-parallel N: Maximum number of parallel Nextflow runs (default: 4)--quick-test: Run quick test with reduced parameters first--auto-update: Automatically sync/update dependencies
-y, --yes-to-all: Skip confirming files automatically
bindsweeper --config sweep.yaml --output-dir ./resultsbindsweeper --quick-test --config sweep.yamlWhen using --quick-test, BindSweeper reduces the number of designs to validate configurations efficiently:
- Designs per combination: 2 designs
- Sequences per design: 2 sequences
- Total sequences per combination: 4 sequences
For the standard quick test configurations:
- Noise sweep: 3 combinations → 12 total sequences (3 × 4)
- Hotspots sweep: 3 combinations → 12 total sequences (3 × 4)
- Scaffold sweep: 3 combinations → 12 total sequences (3 × 4)
- Multi-dimensional sweep: 4 combinations → 16 total sequences (4 × 4)
- Paired target sweep: N paired targets → N × 4 total sequences (e.g., 3 targets → 12 sequences)
- Paired + unpaired sweep: N paired × M unpaired → N × M × 4 total sequences
This allows rapid validation of your parameter sweep configuration before committing to a full run with the default number of designs.
bindsweeper --dry-run --config sweep.yamlbindsweeper --debug --config sweep.yamlbindsweeper --continue-on-error --config sweep.yamlbindsweeper --resume --config sweep.yamlWhen using --resume, BindSweeper adds the -resume flag to all Nextflow commands. This enables Nextflow's caching mechanism, which:
- Skips tasks that have already completed successfully
- Re-runs only tasks where inputs, parameters, or scripts have changed
- Automatically detects parameter changes and re-executes affected tasks
- Preserves computational resources by avoiding redundant work
Use cases for --resume:
- Interrupted runs: Cluster timeouts, manual cancellation, or system failures
- Iterative development: Testing bug fixes in later pipeline stages while reusing early stage results
- Parameter refinement: Re-running with modified filtering thresholds while keeping expensive fold/sequence generation cached
Important notes:
- Nextflow determines what to cache based on task hashes (inputs, scripts, parameters, containers)
- If you modify any parameters (fixed or swept), Nextflow will automatically detect this and re-run affected tasks
- The
.nextflow/cache/andwork/directories must be preserved for resume to work - Resume works at the task level within each parameter combination, not at the combination level
# Execute up to 4 combinations in parallel (default)
bindsweeper --parallel --config sweep.yaml
# Control the maximum number of parallel runs
bindsweeper --parallel --max-parallel 8 --config sweep.yaml
# Combine with resume for robust parallel execution
bindsweeper --parallel --resume --max-parallel 6 --config sweep.yamlWhen using --parallel, BindSweeper executes multiple parameter combinations concurrently, which:
- Runs multiple independent Nextflow pipelines simultaneously
- Provides isolated cache directories for each combination (prevents cache conflicts)
- Improves overall throughput on systems with available GPU and CPU resources
- Each Nextflow run still internally parallelizes tasks as normal
- Maintains proper resource allocation through the cluster scheduler
Benefits of parallel execution:
- Faster completion: Leverage multiple GPUs and CPUs concurrently across different parameter combinations
- Better resource utilization: Keep GPUs busy while other combinations process CPU tasks
- Fault tolerance: Failed combinations don't block others from completing
- Natural batching: Combinations complete and release resources as they finish
Use cases for --parallel:
- Large parameter sweeps: When testing 8+ parameter combinations
- GPU-rich clusters: Systems with multiple GPUs available for concurrent use
- Mixed GPU/CPU workloads: Combinations naturally interleave GPU and CPU-intensive stages
- Time-sensitive projects: Need results faster than sequential execution allows
Resource considerations:
--max-parallelshould be ≤ number of available GPUs to prevent GPU contention- Each combination spawns its own Nextflow session with internal task parallelization
- Monitor cluster queue status to ensure combinations get scheduled efficiently
- Disk I/O can become a bottleneck with too many parallel runs
- Consider available memory: multiple AF2/Boltz runs require significant RAM per GPU
Important notes:
- Each combination uses isolated cache in
<output_dir>/.nextflow_cache/ - Combinations are truly independent - no shared state or locks
- Works seamlessly with
--resumeflag for robust parallel execution - Log output is captured per combination in respective output directories
- Works with
--quick-testflag for rapid validation of large parameter sweeps
BindSweeper generates organised output directories:
results/
├── combination1_param1_val1_param2_val2/
│ ├── results/
│ ├── best_designs/
│ └── logs/
├── combination2_param1_val3_param2_val4/
│ └── ...
├── sweep_results.csv # Combined results
├── sweep_designs/ # Best designs from all runs
└── bindsweeper.log # Execution log
BindSweeper automatically:
- Collects results from all parameter combinations
- Ranks designs based on metrics
- Copies best designs to a central directory
- Generates summary CSV files
- Optionally creates ZIP archives
- Start with Quick Tests: Use
--quick-testto validate configuration - Use Dry Runs: Preview commands with
--dry-runbefore execution - Use Resume for Long Runs: Always use
--resumefor multi-hour sweeps to recover from interruptions - Leverage Parallel Execution: Use
--parallelfor large sweeps on GPU-rich clusters to improve throughput - Monitor Resources: Large parameter sweeps can be resource-intensive, especially when running in parallel
- Optimize
--max-parallel: Set to match available GPUs (e.g.,--max-parallel 8on systems with 8+ GPUs) - Combine Resume and Parallel: Use both
--resume --parallelfor robust and efficient large-scale sweeps - Organize Results: Use descriptive output directory names
- Check Dependencies: Ensure ProteinDJ and required tools are installed
- Preserve Cache Directories: Keep
.nextflow/andwork/directories to enable resume functionality - Monitor Parallel Runs: Check cluster queue to ensure combinations are being scheduled appropriately
- Missing nextflow.config: Ensure the file exists in current or parent directory
- Invalid parameters: Check YAML syntax and parameter names
- Resource constraints: Monitor system resources during execution
- Path issues: Use absolute paths for input files
- Resume not working: Ensure
.nextflow/cache/andwork/directories exist and haven't been cleaned - Unexpected re-execution with resume: Nextflow detects input/parameter/script changes and correctly re-runs affected tasks
Enable debug logging to see detailed execution information:
bindsweeper --debugFor issues or questions:
- Check the log files in the output directory
- Use
--debugflag for detailed information - Review configuration file syntax
- Ensure all dependencies are installed
bindsweeper --pipeline-path /custom/path/main.nfbindsweeper --skip-sweep --output-dir existing_results/