This directory contains tools for evaluating model performance, analyzing results, and generating comprehensive statistics for SimulCost-Bench.
# 1. Run evaluation for a specific model and task
python evaluation/heat_1d/eval.py -m gpt-5-2025-08-07 -d heat_1d -t cfl -l medium -z
# 2. Merge results from multiple datasets
python evaluation/stats_utils/merge_results.py
# 3. Analyze the merged data
python -c "import pandas as pd; df = pd.read_parquet('eval_results/merged_results.parquet'); print(df.head())"Each dataset has its own evaluation script in the corresponding directory:
# Heat Transfer (1D)
python evaluation/heat_1d/eval.py -m <model> -d heat_1d -t <task> -l <precision> -zCommon Parameters:
-m, --model: Model name/identifier-d, --dataset: Dataset name-t, --task: Task type (e.g.,cfl,n_space,beta)-l, --precision: Precision level (low,medium,high) - only for certain datasets-z, --zero-shot: Enable zero-shot inference mode
Each evaluation run produces two types of outputs:
-
Log Files (detailed results per question):
- Location:
eval_results/{dataset}/{task}/{precision_level}/ - Contains: Full evaluation details for each question instance
- Location:
-
Parquet Files (aggregated dataframe):
- Location:
eval_results/{dataset}/dataframes/ - Filename:
{inference_mode}_{model_name}.parquet - Contains: Structured data ready for analysis
- Location:
The parquet files contain the following columns:
| Column | Type | Description |
|---|---|---|
dataset |
str | Dataset name (e.g., epoch_1d, euler_1d, diff_react_1d) |
task |
str | Task type (e.g., cfl, beta, resolution) |
precision_level |
str | Precision level (low, medium, high) |
inference_mode |
str | Inference mode (zero_shot, iterative) |
model_name |
str | Model identifier |
qid |
int | Question ID |
profile |
str | Problem profile/configuration |
target_parameters |
str | Parameters being optimized |
non_target_parameters |
str | Fixed parameters |
is_converged |
bool | Whether simulation converged |
is_successful |
bool | Whether optimization succeeded |
model_cost |
float | Computational cost of model solution |
dummy_cost |
float | Computational cost of baseline solution |
tolerance |
float | Error tolerance threshold (dataset-specific: some use rmse_tolerance, energy_tolerance, etc.) |
efficiency |
float | Cost efficiency ratio (dummy_cost / model_cost) |
attempt_history |
str (JSON) | Complete experimental trajectory with all tool calls |
EPOCH 1D (epoch_1d):
model_dt_multiplier,dummy_dt_multipliermodel_nx,dummy_nxmodel_npart,dummy_npartmodel_field_order,dummy_field_ordermodel_particle_order,dummy_particle_orderrmse: Root mean square error
Euler 1D (euler_1d):
model_cfl,dummy_cflmodel_beta,dummy_betamodel_k,dummy_kmodel_n_space,dummy_n_spacermse: Root mean square error
Navier-Stokes Transient 2D (ns_transient_2d):
model_resolution,dummy_resolutionmodel_cfl,dummy_cflmodel_relaxation_factor,dummy_relaxation_factormodel_residual_threshold,dummy_residual_thresholdnorm_rmse: Normalized root mean square error
Burgers 1D (burgers_1d):
model_cfl,dummy_cflmodel_beta,dummy_betamodel_k,dummy_kmodel_n_space,dummy_n_spacermse: Root mean square error
Heat Transfer 1D (heat_1d):
model_cfl,dummy_cflmodel_n_space,dummy_n_spacermse: Root mean square error
Heat Transfer 2D (heat_2d):
model_dx,dummy_dxmodel_relax,dummy_relaxmodel_error_threshold,dummy_error_thresholdmodel_t_init,dummy_t_initrmse: Root mean square error
MPM 2D (mpm_2d):
model_nx,dummy_nxmodel_npart,dummy_npartmodel_cfl,dummy_cflcase: Simulation case identifier (e.g., cantilever, vibration_bar, disk_collision)avg_energy_diff: Average energy differenceenergy_tolerance: Energy tolerance thresholdvar_threshold: Variance threshold
Diffusion-Reaction 1D (diff_react_1d):
model_n_space,dummy_n_spacemodel_cfl,dummy_cflmodel_tol,dummy_tolreaction_type: Reaction term type (fisher, allee, cubic)wave_error: Relative wave position errorrmse_tolerance: Task-specific RMSE tolerance (varies by task: n_space/cfl/tol)
Euler 2D (euler_2d):
model_n_grid_x,dummy_n_grid_xmodel_cfl,dummy_cflmodel_cg_tolerance,dummy_cg_tolerancecase: Simulation case identifier (e.g., central_explosion, sod_tube, lax_tube, mach_3, stair_flow, strong_tube, high_mach, interact_blast, rarefaction)rmse: Root mean square errorrmse_tolerance: RMSE tolerance threshold
Hasegawa-Mima Linear (hasegawa_mima_linear):
model_N,dummy_N: Grid resolution parametermodel_dt,dummy_dt: Time step sizemodel_cg_atol,dummy_cg_atol: Conjugate gradient absolute tolerancecase: Simulation case identifier (monopole, dipole, sin_x_gauss_y, gauss_x_sin_y)wall_time_exceeded: Boolean flag indicating if wall time limit was exceededrmse: Root mean square errorrmse_tolerance: RMSE tolerance threshold
Hasegawa-Mima Nonlinear (hasegawa_mima_nonlinear):
model_N,dummy_N: Grid resolution parametermodel_dt,dummy_dt: Time step sizecase: Simulation case identifier (monopole, dipole, sinusoidal, sin_x_gauss_y, gauss_x_sin_y)wall_time_exceeded: Boolean flag indicating if wall time limit was exceededrmse: Root mean square errorrmse_tolerance: RMSE tolerance threshold
FEM 2D (fem_2d):
model_dx,dummy_dx: Spatial grid spacingmodel_cfl,dummy_cfl: CFL number for time steppingcase: Simulation case identifier (cantilever, vibration_bar, twisting_column)wall_time_exceeded: Boolean flag indicating if wall time limit was exceededavg_energy_diff: Average energy difference between model and reference solutionsenergy_tolerance: Energy tolerance thresholdvar_threshold: Variance threshold for convergence
Note: Each dataset has different parameter columns based on the physics simulation. Missing columns are filled with
NaNwhen merging datasets.
The merge_results.py script automatically combines evaluation results from multiple datasets:
python evaluation/merge_results.pyWhat it does:
- Scans
eval_results/{dataset}/dataframes/for all parquet files - Combines files from standard and ICL variant datasets:
- Standard datasets:
epoch_1d,euler_1d,euler_2d,ns_transient_2d,burgers_1d,heat_1d,heat_2d,mpm_2d,diff_react_1d,hasegawa_mima_linear,hasegawa_mima_nonlinear - ICL variants:
euler_1d_icl_accuracy_focused,euler_1d_icl_cost_excluded,euler_1d_icl_full,heat_1d_icl_accuracy_focused,heat_1d_icl_cost_excluded,heat_1d_icl_full,ns_transient_2d_icl_accuracy_focused,ns_transient_2d_icl_cost_excluded,ns_transient_2d_icl_full,mpm_2d_icl_accuracy_focused,mpm_2d_icl_cost_excluded,mpm_2d_icl_full
- Standard datasets:
- Applies model name mapping to standardize model identifiers (e.g.,
qwen3_8b→Qwen3-8B) - Validates that all model names are in the mapping dictionary (raises error if unmapped models found)
- Handles schema differences (fills missing columns with
NaN) - Adds/validates
datasetcolumn for each record - Outputs unified file:
eval_results/merged_results.parquet
The merge script automatically standardizes model names for consistency. The following mappings are applied:
| Original Model Name | Standardized Name |
|---|---|
qwen3_8b |
Qwen3-8B |
qwen3_32b |
Qwen3-32B |
qwen3_0_6b |
Qwen3-0.6B |
anthropic.claude-3-5-sonnet-20240620-v1:0 |
Claude-3.5-Sonnet |
anthropic.claude-3-5-haiku-20241022-v1:0 |
Claude-3.5-Haiku |
anthropic.claude-3-7-sonnet-20250219-v1:0 |
Claude-3.7-Sonnet |
amazon.nova-premier-v1:0 |
Nova-Premier |
mistral.mistral-large-2402-v1:0 |
Mistral-Large |
meta.llama3-70b-instruct-v1:0 |
Llama-3-70B-Instruct |
gpt-5-2025-08-07 |
GPT-5 |
gpt-5-2025-08-07-re-minimal |
GPT-5-RE-Minimal |
gpt-5-2025-08-07-re-high |
GPT-5-RE-High |
Important: All model names must be in the mapping dictionary. If an unmapped model name is encountered, the script will raise a ValueError with the unmapped model names listed. To add new models, edit MODEL_NAME_MAPPING in evaluation/merge_results.py.
import pandas as pd
# Read merged results
df = pd.read_parquet('eval_results/merged_results.parquet')
# Filter by standard dataset
epoch_df = df[df['dataset'] == 'epoch_1d']
# Filter by ICL variant
icl_accuracy_df = df[df['dataset'].str.contains('icl_accuracy_focused')]
icl_cost_excluded_df = df[df['dataset'].str.contains('icl_cost_excluded')]
icl_full_df = df[df['dataset'].str.contains('icl_full')]
# Filter by model (using standardized names)
qwen_32b_df = df[df['model_name'] == 'Qwen3-32B']
claude_df = df[df['model_name'] == 'Claude-3.7-Sonnet']
# Filter by inference mode
zero_shot_df = df[df['inference_mode'] == 'zero_shot']