Conversation
rbarrue
left a comment
There was a problem hiding this comment.
Hi Daohan, the code looks in great shape. I just left a few comments, nothing major.
ML/c2st/DNN.py
Outdated
| out = self.model(Xn, training=False) # 直接喂 numpy,TF 会零拷贝/少拷贝处理 | ||
| return out.numpy() | ||
|
|
||
| def logits_tf(self, X_tf: tf.Tensor, training: bool) -> tf.Tensor: |
There was a problem hiding this comment.
I don't see this function used anywhere, can you confirm ? Otherwise we can delete it.
There was a problem hiding this comment.
We’re not using logits_tf anywhere in the current code, it can be safely removed; I added it for debugging but forgot to remove.
ML/c2st/dnn_training.py
Outdated
| batch_size = int(optim_cfg.get("batch_size", 4096)) | ||
|
|
||
| # test2 toys | ||
| n_trials = int(extras.get("n_toys", 1000)) |
There was a problem hiding this comment.
Normally toys are related to pseudo-experiments, when we take a small sample from a large weighted dataset to obtain a dataset with realistic statistical fluctuations.
Here, you mean the number of independent datasets obtained from shuffling the training data labels, right ? If so, I would keep n_trials also in the configs.
There was a problem hiding this comment.
Yes, exactly. here toys refer to independent datasets obtained by shuffling the training labels. I’ll update the configs to keep n_trials there.
| # ---------------- build C2ST dataset ---------------- | ||
| def build_c2st_arrays(test_id: int, split: str, shard_limit=None): | ||
| """ | ||
| User-required definitions: |
There was a problem hiding this comment.
The description of what each test is should also be in the C2ST config, at least once.
There was a problem hiding this comment.
I'll add the descriptions to the config file.
ML/c2st/unbinned_2017.yaml
Outdated
| region: SR | ||
| extras: | ||
| test_type: test3 | ||
| syst_job_id: pnn_TTLep_pow_2017_scales |
There was a problem hiding this comment.
Is there ever any difference between syst_job_id and pnn_job_id ?
There was a problem hiding this comment.
They’re same, I’ll update the naming to make it consistent.
There was a problem hiding this comment.
Pull request overview
Adds new tooling for (1) exporting BIT predictions/truth to CSV for calibration workflows and (2) running DNN-based C2ST studies driven by YAML configuration.
Changes:
- Added BIT evaluation script to materialize train/val splits and write prediction/truth CSVs.
- Added a DNN C2ST training pipeline (model class, training script) and a plotting helper.
- Added a large 2017 unbinned YAML config intended to define splitting + C2ST jobs, plus sbatch wrappers.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| ML/Calibration/bit_prediction_to_csv.sh | SLURM wrapper intended to run BIT→CSV calibration export. |
| ML/Calibration/bit_prediction_to_csv.py | BIT evaluation/export script that builds UID splits and writes calibration CSVs. |
| ML/c2st/DNN.py | Minimal TensorFlow DNN binary classifier with save/load helpers. |
| ML/c2st/dnn_training.py | YAML-driven C2ST training loop (test1/2/3) with optional PNN-based reweighting. |
| ML/c2st/plot.py | Standalone plotting script for AUC histogram visualization. |
| ML/c2st/unbinned_2017.yaml | New 2017 config with defaults + many jobs including C2ST entries. |
| ML/c2st/c2st_training.sh | SLURM wrapper intended to run a C2ST training job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "Job started at: $(date)" | ||
|
|
||
| python -u Calibrationx.py /users/daohan.wang/gluon-pdf-sbi3x/configs/unbinned_v5D/unbinned_delphes.yaml --job bit_NG_PDF4LHC21_6_tt2l_delphes --max-files 20 --write_pandas | ||
|
|
There was a problem hiding this comment.
The script invokes Calibrationx.py, but that file is not present in ML/Calibration/. This will fail at runtime; update the command to call the intended entrypoint (likely bit_prediction_to_csv.py or Calibration.py).
| #SBATCH --job-name=retrain1 | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --ntasks=1 | ||
| #SBATCH --cpus-per-task=6 | ||
| #SBATCH --mem=40G | ||
| #SBATCH --time=0-01:00:00 | ||
| #SBATCH --qos=short | ||
| #SBATCH --output=retrain1.stdout |
There was a problem hiding this comment.
This SBATCH script hard-codes an absolute config path (/users/.../gluon-pdf-sbi3x/...) and a specific job id. That makes the repo script non-portable; consider passing config/job via script args or using repo-relative paths under configs/.
| if args.max_files is not None: | ||
| n = int(args.max_files) | ||
| if n <= 0: | ||
| raise RuntimeError("--max-files must be > 0") | ||
| files = L.files[n:40] | ||
| L = L.clone_from_files(files) | ||
| tqdm.write(f"[DATA] Using only first {n} files (of original {len(getattr(getattr(samples_mod, loader_name), 'files', []))}):") | ||
| tqdm.write(f"[DATA] First file: {files[0]}") |
There was a problem hiding this comment.
--max-files is documented as "first N ROOT files", but the slice uses L.files[n:40] (skips the first N and truncates at 40). Use L.files[:n] (and guard against n > len(L.files) / empty result) to match the flag’s semantics.
| # -------------------------- UID Splitting -------------------------- | ||
| UID_CFG = (J.get("splitting") or {}) | ||
| uid_enabled = bool(UID_CFG.get("enabled", False)) | ||
| uid_fields = UID_CFG.get("uid_fields", ["run", "luminosityBlock", "event"]) | ||
| uid_seed = int(UID_CFG.get("seed", 0)) | ||
| uid_n_buckets = int(UID_CFG.get("n_buckets", 10000)) | ||
| uid_scheme = (UID_CFG.get("scheme") or {}) | ||
|
|
||
| uid_intervals = None | ||
| uid_splitter = None | ||
| train_interval = None | ||
| val_interval = None | ||
|
|
||
| if uid_enabled: | ||
| uid_splitter = UIDSplitter( | ||
| uid_fields=tuple(uid_fields), | ||
| seed=uid_seed, | ||
| n_buckets=uid_n_buckets, | ||
| ) | ||
|
|
||
| # build bucket intervals EXACTLY like PNN | ||
| keys = list(uid_scheme.keys()) | ||
| fracs = [float((uid_scheme[k] or {}).get("fraction", 0.0)) for k in keys] | ||
|
|
||
| sizes = [int(math.floor(f * uid_n_buckets)) for f in fracs] | ||
| sizes[-1] += uid_n_buckets - sum(sizes) | ||
|
|
||
| uid_intervals = {} | ||
| lo = 0 | ||
| for k, sz in zip(keys, sizes): | ||
| uid_intervals[k] = (lo, lo + int(sz)) | ||
| lo += int(sz) | ||
| calibration_train_key = "c2st_train" | ||
| calibration_val_key = "c2st_val" | ||
| train_interval = uid_intervals[calibration_train_key] | ||
| val_interval = uid_intervals[calibration_val_key] | ||
|
|
||
| print(f"[UID] enabled=True fields={uid_fields} seed={uid_seed} n_buckets={uid_n_buckets}") | ||
| print(f"[UID] scheme intervals: {uid_intervals}") | ||
| print(f"[UID] BIT train split '{calibration_train_key}' -> {train_interval}") | ||
| print(f"[UID] BIT val split '{calibration_val_key}' -> {val_interval}") | ||
|
|
||
| # ---------------- PDF parametrization & combinations ---------------- | ||
| pdf_n = J.get("pdf", {}).get("pdf_n", None) | ||
| pdf_type = J.get("pdf", {}).get("pdf_type", None) | ||
| pdf_basis = J.get("pdf", {}).get("pdf_basis", None) | ||
| pdf = PDFParametrization(n=pdf_n, typ=pdf_type, basis=pdf_basis) | ||
|
|
||
| combos = list(pdf.combinations) # (), ('c0',), ..., ('ci','cj') | ||
| # Build base_points like the legacy script (order up to 2) | ||
| base_points = [] | ||
| vars_ = pdf.variables | ||
| import itertools | ||
| for comb in itertools.combinations_with_replacement(vars_, 1): | ||
| base_points.append({v: comb.count(v) for v in vars_}) | ||
| for comb in itertools.combinations_with_replacement(vars_, 2): | ||
| base_points.append({v: comb.count(v) for v in vars_}) | ||
|
|
||
|
|
||
| # ---------------- collect all data (single pass) ---------------- | ||
| def iterate_all(shard_limit=None): | ||
| """ | ||
| Yields per-shard arrays PLUS (optionally) UID masks for train/val. | ||
| - Always yields X,Q,x1,x2,id1,id2,w | ||
| - If uid_enabled: also yields (m_tr, m_va) boolean masks with same length as X | ||
| """ | ||
| n_shards = len(L) | ||
| if args.small: n_shards = 1 | ||
| if shard_limit is not None: n_shards = min(n_shards, shard_limit) | ||
| on2idx = {n: i for i, n in enumerate(obs_names)} | ||
|
|
There was a problem hiding this comment.
UID splitting is optional via splitting.enabled, but iterate_all() unconditionally builds uid_idx and uses train_interval/val_interval. If uid_enabled is false (or scheme keys missing), this will crash. Either enforce uid_enabled=True with a clear error, or mirror ML/BIT/pdf_bit_training.py by yielding an unsplit path when disabled.
| # observers: must contain generator columns in this order | ||
| GEN_OBS = ["Generator_x1", "Generator_x2", "Generator_id1", "Generator_id2"] | ||
| obs_names = list(getattr(L, "observer_names", []) or []) | ||
| missing_gen = [n for n in GEN_OBS if n not in obs_names] | ||
| if missing_gen: | ||
| raise RuntimeError( | ||
| f"Observer_names must include {GEN_OBS}, missing {missing_gen} in loader '{loader_name}'." | ||
| ) |
There was a problem hiding this comment.
GEN_OBS validation omits Generator_scalePDF, but later iterate_all() requires it via on2idx["Generator_scalePDF"]. Add Generator_scalePDF to GEN_OBS (as done in ML/BIT/pdf_bit_training.py) so missing observers fail with a descriptive error instead of a KeyError.
| UID_CFG = (J.get("splitting") or {}) | ||
| uid_enabled = bool(UID_CFG.get("enabled", False)) | ||
| uid_fields = UID_CFG.get("uid_fields", ["run", "luminosityBlock", "event"]) | ||
| uid_seed = int(UID_CFG.get("seed", 0)) | ||
| uid_n_buckets = int(UID_CFG.get("n_buckets", 10000)) | ||
| uid_scheme = (UID_CFG.get("scheme") or {}) | ||
|
|
||
| uid_intervals = None | ||
| uid_splitter = None | ||
| if uid_enabled: | ||
| uid_splitter = UIDSplitter( | ||
| uid_fields=tuple(uid_fields), | ||
| seed=uid_seed, | ||
| n_buckets=uid_n_buckets, | ||
| ) | ||
|
|
||
| keys = list(uid_scheme.keys()) | ||
| fracs = [float((uid_scheme[k] or {}).get("fraction", 0.0)) for k in keys] | ||
|
|
||
| sizes = [int(math.floor(f * uid_n_buckets)) for f in fracs] | ||
| sizes[-1] += uid_n_buckets - sum(sizes) | ||
|
|
||
| uid_intervals = {} | ||
| lo = 0 | ||
| for k, sz in zip(keys, sizes): | ||
| uid_intervals[k] = (lo, lo + int(sz)) | ||
| lo += int(sz) | ||
|
|
||
| c2st_train_key = "c2st_train" | ||
| c2st_val_key = "c2st_val" | ||
| train_interval = uid_intervals[c2st_train_key] | ||
| val_interval = uid_intervals[c2st_val_key] | ||
|
|
||
| print(f"[UID] enabled=True fields={uid_fields} seed={uid_seed} n_buckets={uid_n_buckets}") | ||
| print(f"[UID] scheme intervals: {uid_intervals}") | ||
| print(f"[UID] C2ST train split '{c2st_train_key}' -> {train_interval}") | ||
| print(f"[UID] C2ST val split '{c2st_val_key}' -> {val_interval}") | ||
| else: | ||
| raise RuntimeError("C2ST training requires splitting.enabled: true") | ||
|
|
There was a problem hiding this comment.
This script reads splitting config from J.get("splitting"), but the provided ML/c2st/unbinned_2017.yaml defines splitting only under defaults.splitting. Since common.yaml_loader only applies default splitting to pnn/bit jobs, dnn_c2st jobs won’t inherit it and this script will raise C2ST training requires splitting.enabled: true. Consider falling back to CFG["defaults"]["splitting"] when job-level splitting is absent, or require splitting: to be specified per dnn_c2st job in YAML.
| test3: | ||
| - background/class0 = nominal dataset (X0, w0) | ||
| - signal/class1 = REWEIGHTED systematic varied dataset (Xi, wi * exp(dAi @ vk)) | ||
|
|
||
| test2: | ||
| - same (X,y,w) construction as test3 | ||
| - split fixed once (train/val/test fixed) | ||
| - ONLY y_train is shuffled per toy (n_trials times) | ||
| - y_val and y_test remain fixed (never shuffled) | ||
| """ | ||
| if test_id in (2, 3): | ||
| pnn = maybe_load_pnn_for_reweighting() | ||
| if pnn is None: | ||
| raise RuntimeError("test2/test3 require extras.use_pnn=true and valid extras.pnn_job_id.") | ||
| VkA = pnn.VkA | ||
| vk = VkA[var_idx] | ||
| else: | ||
| pnn = None | ||
| vk = None | ||
|
|
||
| X_list, y_list, w_list = [], [], [] | ||
| n0_total, n1_total = 0, 0 | ||
|
|
||
| for Xs, Ws in tqdm(iterate_epoch(split=split, shard_limit=shard_limit), desc="Materialize", unit="shard"): | ||
| X0, w0 = Xs[nom_idx], Ws[nom_idx] | ||
| Xi, wi = Xs[var_idx], Ws[var_idx] | ||
|
|
||
| if len(X0) == 0 or len(Xi) == 0: | ||
| continue | ||
|
|
||
| def _drop_neg(X, w): | ||
| m = (w > 0) | ||
| return X[m], w[m] | ||
|
|
||
| # class0: nominal background | ||
| Xc0 = X0 | ||
| wc0 = w0 | ||
|
|
||
| # class1: varied signal (optionally reweighted) | ||
| Xc1 = Xi | ||
| if test_id == 1: | ||
| wc1 = wi | ||
| else: | ||
| # IMPORTANT: reweight the VARIED sample weights | ||
| dAi = pnn.deltaA(Xi) # (Ni, C) | ||
| wc1 = wi * np.exp(-(dAi @ vk)) | ||
|
|
There was a problem hiding this comment.
The docstring says the reweighted varied weights are wi * exp(dAi @ vk), but the implementation uses wi * exp(-(dAi @ vk)). Please clarify which density-ratio direction is intended (var/nom vs nom/var) and make the docstring and code consistent to avoid sign mistakes in analyses.
| splitting: | ||
| enabled: true | ||
| uid_fields: ["run", "luminosityBlock", "event"] | ||
| seed: 42 | ||
| n_buckets: 10000 | ||
|
|
||
| scheme: | ||
| pnn_train: { fraction: 0.50 } | ||
| pnn_val: { fraction: 0.10 } | ||
| c2st_train: { fraction: 0.15 } | ||
| c2st_val: { fraction: 0.05 } | ||
| final_eval: { fraction: 0.20 } | ||
|
|
There was a problem hiding this comment.
This YAML defines defaults.splitting, but dnn_training.py expects splitting under each dnn_c2st job (and common.yaml_loader does not apply default splitting to dnn_c2st). As-is, running the new C2ST jobs from this config will fail unless splitting is duplicated into each job or the training script is updated to read defaults.
| echo "Job started at: $(date)" | ||
|
|
||
| python -u dnn_training.py configs/unbinned/unbinned_2017.yaml --job c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 --n_split 5 --train_seed 43 | ||
|
|
There was a problem hiding this comment.
The sbatch script points at configs/unbinned/unbinned_2017.yaml, but that config file does not define the c2st_test1_TTLep_pow_2017_CMS_res_j_0_2017 job (and has no dnn_c2st jobs at all). Update the script to reference the intended YAML (e.g. ML/c2st/unbinned_2017.yaml) or add the jobs to the configs under configs/.
| plt.rcParams.update({ | ||
| "text.usetex": True, | ||
| "font.family": "serif", | ||
| }) | ||
| input_file = "2017.txt" | ||
| acc_values = [] | ||
|
|
||
| with open(input_file, "r") as f: | ||
| for line in f: |
There was a problem hiding this comment.
This plotting script hard-codes input_file = "2017.txt" and enables text.usetex=True, which will fail unless run from the right working directory with a LaTeX installation. Consider adding CLI args for input/output and making LaTeX rendering optional (default off) so the script is runnable in more environments.
No description provided.