To create the conda environment, run the following commands:
conda create --name scigym python=3.10.16 -y
conda activate scigym
pip install -e .
# Required for graph edit distance metric
conda install --channel conda-forge pygraphviz
# Optional development tools
pip install pre-commit
pre-commit installWe host our full benchmark suite on HuggingFace and provide a script to download it. We provide two splits of the benchmark dataset:
small: Consists of 137 models we evaluated in our paperlarge: Consists of an additional 213 models we did not evaluate
To download the splits, run the following commands:
python data/download.py --split small --save_dir <path_to_save_dir>
python data/download.py --split large --save_dir <path_to_save_dir>You can either use one of our supported agents or set up your own agent. To set up your own agent, you need to implement the LLM interface. We provide implementations for Claude, Gemini, and GPT models in scigym/agent folder, along with examples on how to use these models below.
gemini-2.5-pro-preview-03-25claude-3-5-haiku-20241022claude-3-7-sonnet-20250219
To run the benchmark, you need two things: an agent and a configuration dict that specifies the required parameters for the input, output, and environment components of the run. We provide a default configuration file in configs/default.yml which you can modify to suit your needs. The fields in this configuration file are detailed below.
See also scigym/examples for more examples.
from scigym.agent import Claude
from scigym.main import setup_controller
model_name = "claude-3-5-haiku-20241022"
controller = setup_controller("config/default.yml", model_name)
system_prompt = controller._create_system_prompt()
llm = Claude(model_name=model_name, system_prompt=system_prompt)
controller.run_benchmark(model=llm)benchmark_dir: The directory where the benchmark instance folder is located. You will have these folders after downloading the benchmark dataset.test_memorize: Whether to test the agent's ability to memorize the model in a one-shot settingeval_debug_rounds: Number of rounds to allow the agent to re-submit its hypothesis SBML if there are errors in the previous submissionmax_iterations: The maximum number of actions the agent can take in a single episodeexperiment_actions_path: Path to system prompt explaining the experimental actionscustomized_functions_path: Path to system prompt explaining the routines that the agent can useoutput_dir: The directory where the output files will be saved