Skip to content

Add c2st and calibration#112

Open
Dorhand wants to merge 12 commits intosbi-pdffrom
add-c2st-calibration
Open

Add c2st and calibration#112
Dorhand wants to merge 12 commits intosbi-pdffrom
add-c2st-calibration

Conversation

@Dorhand
Copy link
Copy Markdown
Collaborator

@Dorhand Dorhand commented Apr 2, 2026

No description provided.

@rbarrue rbarrue self-requested a review April 8, 2026 15:03
Copy link
Copy Markdown

@rbarrue rbarrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Daohan, the code looks in great shape. I just left a few comments, nothing major.

ML/c2st/DNN.py Outdated
out = self.model(Xn, training=False) # 直接喂 numpy,TF 会零拷贝/少拷贝处理
return out.numpy()

def logits_tf(self, X_tf: tf.Tensor, training: bool) -> tf.Tensor:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function used anywhere, can you confirm ? Otherwise we can delete it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re not using logits_tf anywhere in the current code, it can be safely removed; I added it for debugging but forgot to remove.

batch_size = int(optim_cfg.get("batch_size", 4096))

# test2 toys
n_trials = int(extras.get("n_toys", 1000))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally toys are related to pseudo-experiments, when we take a small sample from a large weighted dataset to obtain a dataset with realistic statistical fluctuations.

Here, you mean the number of independent datasets obtained from shuffling the training data labels, right ? If so, I would keep n_trials also in the configs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. here toys refer to independent datasets obtained by shuffling the training labels. I’ll update the configs to keep n_trials there.

# ---------------- build C2ST dataset ----------------
def build_c2st_arrays(test_id: int, split: str, shard_limit=None):
"""
User-required definitions:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of what each test is should also be in the C2ST config, at least once.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the descriptions to the config file.

region: SR
extras:
test_type: test3
syst_job_id: pnn_TTLep_pow_2017_scales
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever any difference between syst_job_id and pnn_job_id ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They’re same, I’ll update the naming to make it consistent.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new tooling for (1) exporting BIT predictions/truth to CSV for calibration workflows and (2) running DNN-based C2ST studies driven by YAML configuration.

Changes:

  • Added BIT evaluation script to materialize train/val splits and write prediction/truth CSVs.
  • Added a DNN C2ST training pipeline (model class, training script) and a plotting helper.
  • Added a large 2017 unbinned YAML config intended to define splitting + C2ST jobs, plus sbatch wrappers.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
ML/Calibration/bit_prediction_to_csv.sh SLURM wrapper intended to run BIT→CSV calibration export.
ML/Calibration/bit_prediction_to_csv.py BIT evaluation/export script that builds UID splits and writes calibration CSVs.
ML/c2st/DNN.py Minimal TensorFlow DNN binary classifier with save/load helpers.
ML/c2st/dnn_training.py YAML-driven C2ST training loop (test1/2/3) with optional PNN-based reweighting.
ML/c2st/plot.py Standalone plotting script for AUC histogram visualization.
ML/c2st/unbinned_2017.yaml New 2017 config with defaults + many jobs including C2ST entries.
ML/c2st/c2st_training.sh SLURM wrapper intended to run a C2ST training job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +12 to +15
echo "Job started at: $(date)"

python -u Calibrationx.py /users/daohan.wang/gluon-pdf-sbi3x/configs/unbinned_v5D/unbinned_delphes.yaml --job bit_NG_PDF4LHC21_6_tt2l_delphes --max-files 20 --write_pandas

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script invokes Calibrationx.py, but that file is not present in ML/Calibration/. This will fail at runtime; update the command to call the intended entrypoint (likely bit_prediction_to_csv.py or Calibration.py).

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +10
#SBATCH --job-name=retrain1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=40G
#SBATCH --time=0-01:00:00
#SBATCH --qos=short
#SBATCH --output=retrain1.stdout
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SBATCH script hard-codes an absolute config path (/users/.../gluon-pdf-sbi3x/...) and a specific job id. That makes the repo script non-portable; consider passing config/job via script args or using repo-relative paths under configs/.

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +67
if args.max_files is not None:
n = int(args.max_files)
if n <= 0:
raise RuntimeError("--max-files must be > 0")
files = L.files[n:40]
L = L.clone_from_files(files)
tqdm.write(f"[DATA] Using only first {n} files (of original {len(getattr(getattr(samples_mod, loader_name), 'files', []))}):")
tqdm.write(f"[DATA] First file: {files[0]}")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-files is documented as "first N ROOT files", but the slice uses L.files[n:40] (skips the first N and truncates at 40). Use L.files[:n] (and guard against n > len(L.files) / empty result) to match the flag’s semantics.

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +156
# -------------------------- UID Splitting --------------------------
UID_CFG = (J.get("splitting") or {})
uid_enabled = bool(UID_CFG.get("enabled", False))
uid_fields = UID_CFG.get("uid_fields", ["run", "luminosityBlock", "event"])
uid_seed = int(UID_CFG.get("seed", 0))
uid_n_buckets = int(UID_CFG.get("n_buckets", 10000))
uid_scheme = (UID_CFG.get("scheme") or {})

uid_intervals = None
uid_splitter = None
train_interval = None
val_interval = None

if uid_enabled:
uid_splitter = UIDSplitter(
uid_fields=tuple(uid_fields),
seed=uid_seed,
n_buckets=uid_n_buckets,
)

# build bucket intervals EXACTLY like PNN
keys = list(uid_scheme.keys())
fracs = [float((uid_scheme[k] or {}).get("fraction", 0.0)) for k in keys]

sizes = [int(math.floor(f * uid_n_buckets)) for f in fracs]
sizes[-1] += uid_n_buckets - sum(sizes)

uid_intervals = {}
lo = 0
for k, sz in zip(keys, sizes):
uid_intervals[k] = (lo, lo + int(sz))
lo += int(sz)
calibration_train_key = "c2st_train"
calibration_val_key = "c2st_val"
train_interval = uid_intervals[calibration_train_key]
val_interval = uid_intervals[calibration_val_key]

print(f"[UID] enabled=True fields={uid_fields} seed={uid_seed} n_buckets={uid_n_buckets}")
print(f"[UID] scheme intervals: {uid_intervals}")
print(f"[UID] BIT train split '{calibration_train_key}' -> {train_interval}")
print(f"[UID] BIT val split '{calibration_val_key}' -> {val_interval}")

# ---------------- PDF parametrization & combinations ----------------
pdf_n = J.get("pdf", {}).get("pdf_n", None)
pdf_type = J.get("pdf", {}).get("pdf_type", None)
pdf_basis = J.get("pdf", {}).get("pdf_basis", None)
pdf = PDFParametrization(n=pdf_n, typ=pdf_type, basis=pdf_basis)

combos = list(pdf.combinations) # (), ('c0',), ..., ('ci','cj')
# Build base_points like the legacy script (order up to 2)
base_points = []
vars_ = pdf.variables
import itertools
for comb in itertools.combinations_with_replacement(vars_, 1):
base_points.append({v: comb.count(v) for v in vars_})
for comb in itertools.combinations_with_replacement(vars_, 2):
base_points.append({v: comb.count(v) for v in vars_})


# ---------------- collect all data (single pass) ----------------
def iterate_all(shard_limit=None):
"""
Yields per-shard arrays PLUS (optionally) UID masks for train/val.
- Always yields X,Q,x1,x2,id1,id2,w
- If uid_enabled: also yields (m_tr, m_va) boolean masks with same length as X
"""
n_shards = len(L)
if args.small: n_shards = 1
if shard_limit is not None: n_shards = min(n_shards, shard_limit)
on2idx = {n: i for i, n in enumerate(obs_names)}

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UID splitting is optional via splitting.enabled, but iterate_all() unconditionally builds uid_idx and uses train_interval/val_interval. If uid_enabled is false (or scheme keys missing), this will crash. Either enforce uid_enabled=True with a clear error, or mirror ML/BIT/pdf_bit_training.py by yielding an unsplit path when disabled.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +84
# observers: must contain generator columns in this order
GEN_OBS = ["Generator_x1", "Generator_x2", "Generator_id1", "Generator_id2"]
obs_names = list(getattr(L, "observer_names", []) or [])
missing_gen = [n for n in GEN_OBS if n not in obs_names]
if missing_gen:
raise RuntimeError(
f"Observer_names must include {GEN_OBS}, missing {missing_gen} in loader '{loader_name}'."
)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GEN_OBS validation omits Generator_scalePDF, but later iterate_all() requires it via on2idx["Generator_scalePDF"]. Add Generator_scalePDF to GEN_OBS (as done in ML/BIT/pdf_bit_training.py) so missing observers fail with a descriptive error instead of a KeyError.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +114
UID_CFG = (J.get("splitting") or {})
uid_enabled = bool(UID_CFG.get("enabled", False))
uid_fields = UID_CFG.get("uid_fields", ["run", "luminosityBlock", "event"])
uid_seed = int(UID_CFG.get("seed", 0))
uid_n_buckets = int(UID_CFG.get("n_buckets", 10000))
uid_scheme = (UID_CFG.get("scheme") or {})

uid_intervals = None
uid_splitter = None
if uid_enabled:
uid_splitter = UIDSplitter(
uid_fields=tuple(uid_fields),
seed=uid_seed,
n_buckets=uid_n_buckets,
)

keys = list(uid_scheme.keys())
fracs = [float((uid_scheme[k] or {}).get("fraction", 0.0)) for k in keys]

sizes = [int(math.floor(f * uid_n_buckets)) for f in fracs]
sizes[-1] += uid_n_buckets - sum(sizes)

uid_intervals = {}
lo = 0
for k, sz in zip(keys, sizes):
uid_intervals[k] = (lo, lo + int(sz))
lo += int(sz)

c2st_train_key = "c2st_train"
c2st_val_key = "c2st_val"
train_interval = uid_intervals[c2st_train_key]
val_interval = uid_intervals[c2st_val_key]

print(f"[UID] enabled=True fields={uid_fields} seed={uid_seed} n_buckets={uid_n_buckets}")
print(f"[UID] scheme intervals: {uid_intervals}")
print(f"[UID] C2ST train split '{c2st_train_key}' -> {train_interval}")
print(f"[UID] C2ST val split '{c2st_val_key}' -> {val_interval}")
else:
raise RuntimeError("C2ST training requires splitting.enabled: true")

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script reads splitting config from J.get("splitting"), but the provided ML/c2st/unbinned_2017.yaml defines splitting only under defaults.splitting. Since common.yaml_loader only applies default splitting to pnn/bit jobs, dnn_c2st jobs won’t inherit it and this script will raise C2ST training requires splitting.enabled: true. Consider falling back to CFG["defaults"]["splitting"] when job-level splitting is absent, or require splitting: to be specified per dnn_c2st job in YAML.

Copilot uses AI. Check for mistakes.
Comment on lines +434 to +480
test3:
- background/class0 = nominal dataset (X0, w0)
- signal/class1 = REWEIGHTED systematic varied dataset (Xi, wi * exp(dAi @ vk))

test2:
- same (X,y,w) construction as test3
- split fixed once (train/val/test fixed)
- ONLY y_train is shuffled per toy (n_trials times)
- y_val and y_test remain fixed (never shuffled)
"""
if test_id in (2, 3):
pnn = maybe_load_pnn_for_reweighting()
if pnn is None:
raise RuntimeError("test2/test3 require extras.use_pnn=true and valid extras.pnn_job_id.")
VkA = pnn.VkA
vk = VkA[var_idx]
else:
pnn = None
vk = None

X_list, y_list, w_list = [], [], []
n0_total, n1_total = 0, 0

for Xs, Ws in tqdm(iterate_epoch(split=split, shard_limit=shard_limit), desc="Materialize", unit="shard"):
X0, w0 = Xs[nom_idx], Ws[nom_idx]
Xi, wi = Xs[var_idx], Ws[var_idx]

if len(X0) == 0 or len(Xi) == 0:
continue

def _drop_neg(X, w):
m = (w > 0)
return X[m], w[m]

# class0: nominal background
Xc0 = X0
wc0 = w0

# class1: varied signal (optionally reweighted)
Xc1 = Xi
if test_id == 1:
wc1 = wi
else:
# IMPORTANT: reweight the VARIED sample weights
dAi = pnn.deltaA(Xi) # (Ni, C)
wc1 = wi * np.exp(-(dAi @ vk))

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the reweighted varied weights are wi * exp(dAi @ vk), but the implementation uses wi * exp(-(dAi @ vk)). Please clarify which density-ratio direction is intended (var/nom vs nom/var) and make the docstring and code consistent to avoid sign mistakes in analyses.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +22
splitting:
enabled: true
uid_fields: ["run", "luminosityBlock", "event"]
seed: 42
n_buckets: 10000

scheme:
pnn_train: { fraction: 0.50 }
pnn_val: { fraction: 0.10 }
c2st_train: { fraction: 0.15 }
c2st_val: { fraction: 0.05 }
final_eval: { fraction: 0.20 }

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This YAML defines defaults.splitting, but dnn_training.py expects splitting under each dnn_c2st job (and common.yaml_loader does not apply default splitting to dnn_c2st). As-is, running the new C2ST jobs from this config will fail unless splitting is duplicated into each job or the training script is updated to read defaults.

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +15
echo "Job started at: $(date)"

python -u dnn_training.py configs/unbinned/unbinned_2017.yaml --job c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 --n_split 5 --train_seed 43

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sbatch script points at configs/unbinned/unbinned_2017.yaml, but that config file does not define the c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 job (and has no dnn_c2st jobs at all). Update the script to reference the intended YAML (e.g. ML/c2st/unbinned_2017.yaml) or add the jobs to the configs under configs/.

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +15
plt.rcParams.update({
"text.usetex": True,
"font.family": "serif",
})
input_file = "2017.txt"
acc_values = []

with open(input_file, "r") as f:
for line in f:
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plotting script hard-codes input_file = "2017.txt" and enables text.usetex=True, which will fail unless run from the right working directory with a LaTeX installation. Consider adding CLI args for input/output and making LaTeX rendering optional (default off) so the script is runnable in more environments.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants