Add c2st and calibration by Dorhand · Pull Request #112 · HephyAnalysisSW/GOLLUM

Dorhand · 2026-04-02T17:44:51Z

No description provided.

rbarrue

Hi Daohan, the code looks in great shape. I just left a few comments, nothing major.

ML/c2st/DNN.py

rbarrue · 2026-04-09T09:07:50Z

ML/c2st/DNN.py

+        out = self.model(Xn, training=False)   # 直接喂 numpy，TF 会零拷贝/少拷贝处理
+        return out.numpy()
+
+    def logits_tf(self, X_tf: tf.Tensor, training: bool) -> tf.Tensor:


I don't see this function used anywhere, can you confirm ? Otherwise we can delete it.

We’re not using logits_tf anywhere in the current code, it can be safely removed; I added it for debugging but forgot to remove.

ML/c2st/unbinned_2017.yaml

rbarrue · 2026-04-09T09:12:34Z

ML/c2st/dnn_training.py

+batch_size = int(optim_cfg.get("batch_size", 4096))
+
+# test2 toys
+n_trials = int(extras.get("n_toys", 1000))


Normally toys are related to pseudo-experiments, when we take a small sample from a large weighted dataset to obtain a dataset with realistic statistical fluctuations.

Here, you mean the number of independent datasets obtained from shuffling the training data labels, right ? If so, I would keep n_trials also in the configs.

Yes, exactly. here toys refer to independent datasets obtained by shuffling the training labels. I’ll update the configs to keep n_trials there.

rbarrue · 2026-04-09T09:13:19Z

ML/c2st/dnn_training.py

+# ---------------- build C2ST dataset ----------------
+def build_c2st_arrays(test_id: int, split: str, shard_limit=None):
+    """
+    User-required definitions:


The description of what each test is should also be in the C2ST config, at least once.

I'll add the descriptions to the config file.

rbarrue · 2026-04-09T09:14:55Z

ML/c2st/unbinned_2017.yaml

+    region: SR
+    extras:
+      test_type: test3
+      syst_job_id: pnn_TTLep_pow_2017_scales


Is there ever any difference between syst_job_id and pnn_job_id ?

They’re same, I’ll update the naming to make it consistent.

Copilot

Pull request overview

Adds new tooling for (1) exporting BIT predictions/truth to CSV for calibration workflows and (2) running DNN-based C2ST studies driven by YAML configuration.

Changes:

Added BIT evaluation script to materialize train/val splits and write prediction/truth CSVs.
Added a DNN C2ST training pipeline (model class, training script) and a plotting helper.
Added a large 2017 unbinned YAML config intended to define splitting + C2ST jobs, plus sbatch wrappers.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
ML/Calibration/bit_prediction_to_csv.sh	SLURM wrapper intended to run BIT→CSV calibration export.
ML/Calibration/bit_prediction_to_csv.py	BIT evaluation/export script that builds UID splits and writes calibration CSVs.
ML/c2st/DNN.py	Minimal TensorFlow DNN binary classifier with save/load helpers.
ML/c2st/dnn_training.py	YAML-driven C2ST training loop (test1/2/3) with optional PNN-based reweighting.
ML/c2st/plot.py	Standalone plotting script for AUC histogram visualization.
ML/c2st/unbinned_2017.yaml	New 2017 config with defaults + many jobs including C2ST entries.
ML/c2st/c2st_training.sh	SLURM wrapper intended to run a C2ST training job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T09:23:44Z

ML/Calibration/bit_prediction_to_csv.sh

+echo "Job started at: $(date)"
+
+python -u Calibrationx.py  /users/daohan.wang/gluon-pdf-sbi3x/configs/unbinned_v5D/unbinned_delphes.yaml   --job bit_NG_PDF4LHC21_6_tt2l_delphes --max-files 20   --write_pandas
+


The script invokes Calibrationx.py, but that file is not present in ML/Calibration/. This will fail at runtime; update the command to call the intended entrypoint (likely bit_prediction_to_csv.py or Calibration.py).

Copilot · 2026-04-09T09:23:44Z

ML/Calibration/bit_prediction_to_csv.sh

+#SBATCH --job-name=retrain1
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=6
+#SBATCH --mem=40G
+#SBATCH --time=0-01:00:00
+#SBATCH --qos=short
+#SBATCH --output=retrain1.stdout


This SBATCH script hard-codes an absolute config path (/users/.../gluon-pdf-sbi3x/...) and a specific job id. That makes the repo script non-portable; consider passing config/job via script args or using repo-relative paths under configs/.

Copilot · 2026-04-09T09:23:44Z

ML/Calibration/bit_prediction_to_csv.py

+if args.max_files is not None:
+    n = int(args.max_files)
+    if n <= 0:
+        raise RuntimeError("--max-files must be > 0")
+    files = L.files[n:40]
+    L = L.clone_from_files(files)
+    tqdm.write(f"[DATA] Using only first {n} files (of original {len(getattr(getattr(samples_mod, loader_name), 'files', []))}):")
+    tqdm.write(f"[DATA] First file: {files[0]}")


--max-files is documented as "first N ROOT files", but the slice uses L.files[n:40] (skips the first N and truncates at 40). Use L.files[:n] (and guard against n > len(L.files) / empty result) to match the flag’s semantics.

Copilot · 2026-04-09T09:23:45Z

ML/Calibration/bit_prediction_to_csv.py

+# -------------------------- UID Splitting --------------------------
+UID_CFG = (J.get("splitting") or {})
+uid_enabled   = bool(UID_CFG.get("enabled", False))
+uid_fields    = UID_CFG.get("uid_fields", ["run", "luminosityBlock", "event"])
+uid_seed      = int(UID_CFG.get("seed", 0))
+uid_n_buckets = int(UID_CFG.get("n_buckets", 10000))
+uid_scheme    = (UID_CFG.get("scheme") or {})
+
+uid_intervals = None
+uid_splitter = None
+train_interval = None
+val_interval   = None
+
+if uid_enabled:
+    uid_splitter = UIDSplitter(
+        uid_fields=tuple(uid_fields),
+        seed=uid_seed,
+        n_buckets=uid_n_buckets,
+    )
+
+    # build bucket intervals EXACTLY like PNN
+    keys  = list(uid_scheme.keys())
+    fracs = [float((uid_scheme[k] or {}).get("fraction", 0.0)) for k in keys]
+
+    sizes = [int(math.floor(f * uid_n_buckets)) for f in fracs]
+    sizes[-1] += uid_n_buckets - sum(sizes)
+
+    uid_intervals = {}
+    lo = 0
+    for k, sz in zip(keys, sizes):
+        uid_intervals[k] = (lo, lo + int(sz))
+        lo += int(sz)
+    calibration_train_key = "c2st_train"
+    calibration_val_key   = "c2st_val"
+    train_interval = uid_intervals[calibration_train_key]
+    val_interval   = uid_intervals[calibration_val_key]
+
+    print(f"[UID] enabled=True fields={uid_fields} seed={uid_seed} n_buckets={uid_n_buckets}")
+    print(f"[UID] scheme intervals: {uid_intervals}")
+    print(f"[UID] BIT train split '{calibration_train_key}' -> {train_interval}")
+    print(f"[UID] BIT val   split '{calibration_val_key}' -> {val_interval}")
+
+# ---------------- PDF parametrization & combinations ----------------
+pdf_n = J.get("pdf", {}).get("pdf_n", None)
+pdf_type = J.get("pdf", {}).get("pdf_type", None)
+pdf_basis = J.get("pdf", {}).get("pdf_basis", None)
+pdf = PDFParametrization(n=pdf_n, typ=pdf_type, basis=pdf_basis)      
+
+combos = list(pdf.combinations)                       # (), ('c0',), ..., ('ci','cj')
+# Build base_points like the legacy script (order up to 2)
+base_points = []
+vars_ = pdf.variables
+import itertools
+for comb in itertools.combinations_with_replacement(vars_, 1):
+    base_points.append({v: comb.count(v) for v in vars_})
+for comb in itertools.combinations_with_replacement(vars_, 2):
+    base_points.append({v: comb.count(v) for v in vars_})
+
+
+# ---------------- collect all data (single pass) ----------------
+def iterate_all(shard_limit=None):
+    """
+    Yields per-shard arrays PLUS (optionally) UID masks for train/val.
+    - Always yields X,Q,x1,x2,id1,id2,w
+    - If uid_enabled: also yields (m_tr, m_va) boolean masks with same length as X
+    """
+    n_shards = len(L)
+    if args.small: n_shards = 1
+    if shard_limit is not None: n_shards = min(n_shards, shard_limit)
+    on2idx = {n: i for i, n in enumerate(obs_names)}
+


UID splitting is optional via splitting.enabled, but iterate_all() unconditionally builds uid_idx and uses train_interval/val_interval. If uid_enabled is false (or scheme keys missing), this will crash. Either enforce uid_enabled=True with a clear error, or mirror ML/BIT/pdf_bit_training.py by yielding an unsplit path when disabled.

Copilot · 2026-04-09T09:23:45Z

ML/Calibration/bit_prediction_to_csv.py

+# observers: must contain generator columns in this order
+GEN_OBS = ["Generator_x1", "Generator_x2", "Generator_id1", "Generator_id2"]
+obs_names = list(getattr(L, "observer_names", []) or [])
+missing_gen = [n for n in GEN_OBS if n not in obs_names]
+if missing_gen:
+    raise RuntimeError(
+        f"Observer_names must include {GEN_OBS}, missing {missing_gen} in loader '{loader_name}'."
+    )


GEN_OBS validation omits Generator_scalePDF, but later iterate_all() requires it via on2idx["Generator_scalePDF"]. Add Generator_scalePDF to GEN_OBS (as done in ML/BIT/pdf_bit_training.py) so missing observers fail with a descriptive error instead of a KeyError.

Copilot · 2026-04-09T09:23:46Z

ML/c2st/dnn_training.py

+UID_CFG = (J.get("splitting") or {})
+uid_enabled   = bool(UID_CFG.get("enabled", False))
+uid_fields    = UID_CFG.get("uid_fields", ["run", "luminosityBlock", "event"])
+uid_seed      = int(UID_CFG.get("seed", 0))
+uid_n_buckets = int(UID_CFG.get("n_buckets", 10000))
+uid_scheme    = (UID_CFG.get("scheme") or {})
+
+uid_intervals = None
+uid_splitter = None
+if uid_enabled:
+    uid_splitter = UIDSplitter(
+        uid_fields=tuple(uid_fields),
+        seed=uid_seed,
+        n_buckets=uid_n_buckets,
+    )
+
+    keys  = list(uid_scheme.keys())
+    fracs = [float((uid_scheme[k] or {}).get("fraction", 0.0)) for k in keys]
+
+    sizes = [int(math.floor(f * uid_n_buckets)) for f in fracs]
+    sizes[-1] += uid_n_buckets - sum(sizes)
+
+    uid_intervals = {}
+    lo = 0
+    for k, sz in zip(keys, sizes):
+        uid_intervals[k] = (lo, lo + int(sz))
+        lo += int(sz)
+
+    c2st_train_key = "c2st_train"
+    c2st_val_key   = "c2st_val"
+    train_interval = uid_intervals[c2st_train_key]
+    val_interval   = uid_intervals[c2st_val_key]
+
+    print(f"[UID] enabled=True fields={uid_fields} seed={uid_seed} n_buckets={uid_n_buckets}")
+    print(f"[UID] scheme intervals: {uid_intervals}")
+    print(f"[UID] C2ST train split '{c2st_train_key}' -> {train_interval}")
+    print(f"[UID] C2ST val   split '{c2st_val_key}' -> {val_interval}")
+else:
+    raise RuntimeError("C2ST training requires splitting.enabled: true")
+


This script reads splitting config from J.get("splitting"), but the provided ML/c2st/unbinned_2017.yaml defines splitting only under defaults.splitting. Since common.yaml_loader only applies default splitting to pnn/bit jobs, dnn_c2st jobs won’t inherit it and this script will raise C2ST training requires splitting.enabled: true. Consider falling back to CFG["defaults"]["splitting"] when job-level splitting is absent, or require splitting: to be specified per dnn_c2st job in YAML.

Copilot · 2026-04-09T09:23:46Z

ML/c2st/dnn_training.py

+      test3:
+        - background/class0 = nominal dataset (X0, w0)
+        - signal/class1     = REWEIGHTED systematic varied dataset (Xi, wi * exp(dAi @ vk))
+
+      test2:
+        - same (X,y,w) construction as test3
+        - split fixed once (train/val/test fixed)
+        - ONLY y_train is shuffled per toy (n_trials times)
+        - y_val and y_test remain fixed (never shuffled)
+    """
+    if test_id in (2, 3):
+        pnn = maybe_load_pnn_for_reweighting()
+        if pnn is None:
+            raise RuntimeError("test2/test3 require extras.use_pnn=true and valid extras.pnn_job_id.")
+        VkA = pnn.VkA
+        vk = VkA[var_idx]
+    else:
+        pnn = None
+        vk = None
+
+    X_list, y_list, w_list = [], [], []
+    n0_total, n1_total = 0, 0
+
+    for Xs, Ws in tqdm(iterate_epoch(split=split, shard_limit=shard_limit), desc="Materialize", unit="shard"):
+        X0, w0 = Xs[nom_idx], Ws[nom_idx]
+        Xi, wi = Xs[var_idx], Ws[var_idx]
+
+        if len(X0) == 0 or len(Xi) == 0:
+            continue
+
+        def _drop_neg(X, w):
+            m = (w > 0)
+            return X[m], w[m]
+
+        # class0: nominal background
+        Xc0 = X0
+        wc0 = w0
+
+        # class1: varied signal (optionally reweighted)
+        Xc1 = Xi
+        if test_id == 1:
+            wc1 = wi
+        else:
+            # IMPORTANT: reweight the VARIED sample weights
+            dAi = pnn.deltaA(Xi)  # (Ni, C)
+            wc1 = wi * np.exp(-(dAi @ vk))
+


The docstring says the reweighted varied weights are wi * exp(dAi @ vk), but the implementation uses wi * exp(-(dAi @ vk)). Please clarify which density-ratio direction is intended (var/nom vs nom/var) and make the docstring and code consistent to avoid sign mistakes in analyses.

Copilot · 2026-04-09T09:23:46Z

ML/c2st/unbinned_2017.yaml

+  splitting:
+    enabled: true
+    uid_fields: ["run", "luminosityBlock", "event"]
+    seed: 42
+    n_buckets: 10000 
+
+    scheme:
+      pnn_train:     { fraction: 0.50 }
+      pnn_val:       { fraction: 0.10 }
+      c2st_train:    { fraction: 0.15 }
+      c2st_val:      { fraction: 0.05 }
+      final_eval:    { fraction: 0.20 }
+


This YAML defines defaults.splitting, but dnn_training.py expects splitting under each dnn_c2st job (and common.yaml_loader does not apply default splitting to dnn_c2st). As-is, running the new C2ST jobs from this config will fail unless splitting is duplicated into each job or the training script is updated to read defaults.

Copilot · 2026-04-09T09:23:47Z

ML/c2st/c2st_training.sh

+echo "Job started at: $(date)"
+
+python -u dnn_training.py configs/unbinned/unbinned_2017.yaml --job c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 --n_split 5 --train_seed 43                                                                
+


The sbatch script points at configs/unbinned/unbinned_2017.yaml, but that config file does not define the c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 job (and has no dnn_c2st jobs at all). Update the script to reference the intended YAML (e.g. ML/c2st/unbinned_2017.yaml) or add the jobs to the configs under configs/.

Copilot · 2026-04-09T09:23:47Z

ML/c2st/plot.py

+plt.rcParams.update({
+    "text.usetex": True,
+    "font.family": "serif",
+})
+input_file = "2017.txt"
+acc_values = []
+
+with open(input_file, "r") as f:
+    for line in f:


This plotting script hard-codes input_file = "2017.txt" and enables text.usetex=True, which will fail unless run from the right working directory with a LaTeX installation. Consider adding CLI args for input/output and making LaTeX rendering optional (default off) so the script is runnable in more environments.

Dorhand and others added 4 commits April 2, 2026 19:40

Add files via upload

dbb932d

Add files via upload

6ff9321

Delete c2st directory

3bff2f3

Add files via upload

2a89431

rbarrue self-requested a review April 8, 2026 15:03

rbarrue requested changes Apr 9, 2026

View reviewed changes

rbarrue requested review from Copilot and rbarrue April 9, 2026 09:17

Copilot started reviewing on behalf of rbarrue April 9, 2026 09:18 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Dorhand added 8 commits April 9, 2026 11:42

Update DNN.py

f685f15

Update unbinned_2017.yaml

00622b2

Update dnn_training.py

b20e633

Update dnn_training.py

45d8739

Update unbinned_2017.yaml

ca39317

Update unbinned_2017.yaml

dd38ee8

Update dnn_training.py

40a0e2c

Update unbinned_2017.yaml

bdbdbd5

		echo "Job started at: $(date)"

		python -u Calibrationx.py /users/daohan.wang/gluon-pdf-sbi3x/configs/unbinned_v5D/unbinned_delphes.yaml --job bit_NG_PDF4LHC21_6_tt2l_delphes --max-files 20 --write_pandas

		echo "Job started at: $(date)"

		python -u dnn_training.py configs/unbinned/unbinned_2017.yaml --job c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 --n_split 5 --train_seed 43

Conversation

Dorhand commented Apr 2, 2026

Uh oh!

rbarrue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants