Skip to content

Dorhand/SAGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Higgs ML Uncertainty Challenge -- HEPHY-ML

Hardware

Model training and evaluation were conducted using a single NVIDIA GeForce RTX 3090 GPU (24GB VRAM).

Train Dataset Generation

This section describes how to generate the train datasets used in this project. All scripts related to dataset generation are in the dataset_generation/ folder.

To create the train set, you will need the following four scripts:

  • Higgs_Datasets_Train.py
  • Higgs_Datasets_Train_Generation.py
  • derived_quantities.py
  • systematics.py

Usage: Simply run:

python Higgs_Datasets_Train_Generation.py
  • input_directory specifies the path to the dataset (downloaded from the Higgs Uncertainty Challenge).
  • Six parameters (tes, jes, soft_met, ttbar_scale, diboson_scale, bkg_scale) define systematic uncertainties. Modifying these values generates different systematic variants of the training dataset.
  • hdf5_filename is the path and filename for the output .h5 file.

The resulting dataset is an .h5 file with shape ((N, 30)):

  • The first 16 columns are primary features. The next 12 columns are derived features (see [2410.02867] for details).
  • The 29th column is the event weight.
  • The 30th column is the label, where:
    • 0 = htautau
    • 1 = ztautau
    • 2 = ttbar
    • 3 = diboson

In our study, only one nominal dataset is required, corresponding to: $\alpha_\text{tes}$ = 1.0, $\alpha_\text{jes}$ = 1.0, $\alpha_\text{met}$ = 0.0, $\alpha_\text{ttbar}$ = 1.0, $\alpha_\text{bkg}$ = 1.0, $\alpha_\text{VV}$ = 1.0

After generating the nominal .h5 file, run:

python3 process.py

This converts the .h5 file into a .pt file with shape (N, 18), containing:

  • The first 16 primary features
  • The event weight
  • The label

This processed .pt file is the standard input for model training.
Systematic injection and derived feature computation are both performed dynamically during training.
Alternatively, users may use the provided scripts to pre-generate multiple full training sets with shape (N, 30) to explore different training strategies.

To reduce GPU memory usage, the .pt training file is split into 10 chunks. At the beginning of training, one chunk is randomly selected as the validation set, while the remaining 9 chunks are used for training. Each training chunk is loaded and processed sequentially to fit within memory constraints.

Model Training

Model training is performed using three core scripts:

  • GNN.py: The main training script that defines the model, dataloading, training loop, and evaluation.
  • derived_quantities.py: Computes physics-inspired derived features dynamically during training.
  • systematics.py: Applies systematic variations (e.g., TES, JES, MET) on the fly for each batch.

The latter two modules are designed for efficient and parallelized computation, enabling dynamic injection of systematics and derived feature generation without precomputing large datasets. All scripts are contained in the train folder.

To start training, simply run:

python3 GNN.py

On a single NVIDIA RTX 3090 GPU, the training pipeline completes 12 epochs within 2 days.
Each epoch dynamically samples and trains on 100 distinct systematic configurations, ensuring the model ensuring the model is robust to systematic variations.

For further architectural and algorithmic details, please refer to our SAGE paper.

Interpolator Construction

All files related to interpolator construction are located in the interpolation/ folder. There are six key files:

  • global_stats.pth: Stores the global mean and standard deviation of each feature, computed from the training set. Used for feature normalization.
  • model.pth: Contains the trained model parameters. This file can be directly loaded for inference.
  • derived_quantities.py: Dynamically computes derived features from primary inputs.
  • systematics.py: Applies systematic variations (e.g., TES, JES, soft MET) to the input data.
  • test.py: The core script for inference. It takes specific values of tes, jes, and soft_met as input, and produces the signal-class probability histograms under the trained classifier.

When test.py is executed, it:

  1. Loads the trained model and normalization statistics.
  2. Applies the specified nuisance parameters to all 10 data chunks (9 training + 1 validation).
  3. Computes the weighted classifier output probabilities for the signal class.
  4. Produces four histograms: one for the full region, and one for each of the three control regions. All histograms are saved as .pkl files for later use in interpolation.

We repeat this process for a grid of 11,849 fixed nuisance parameter settings, uniformly spanning:

  • alpha_tes ∈ [0.96, 1.04]
  • alpha_jes ∈ [0.96, 1.04]
  • alpha_soft_met ∈ [0, 5]

These histograms serve as the basis for interpolator construction.

  • interpolation_CR.py: This script performs 3D interpolation over the nuisance parameter space for each region’s histogram, resulting in an interpolator with 11,849 grid points per region.

For detailed construction and interpolation methodology, please refer to the SAGE paper.

Pseudo Dataset Generation

All scripts related to pseudo dataset generation are located in the toy_generation/ folder. The pipeline consists of 8 files in total, but generation can be performed by simply running:

python3 example_fix.py

or

python3 example_random.py
  • example_fix.py: Generates pseudo datasets with a fixed signal strength value.
  • example_random.py: Generates pseudo datasets with random signal strength values.

Within each script, users can configure:

  • Whether each of the 6 nuisance parameters (tes, jes, soft_met, ttbar_scale, diboson_scale, bkg_scale) is fixed or randomly sampled.
  • The total number of pseudo datasets to be generated.

Each generated dataset is saved as an .h5 file with shape (N, 18), containing:

  • The first 16 columns: primary features
  • The 17th column: event weight
  • The 18th column: label

In addition, a pkl file is generated, which contains the true values ​​of the signal strength and the six nuisance parameters.

To convert the .h5 file into a .pt format suitable for fast loading in PyTorch, run:

python3 h5pt.py

This conversion strips the dataset down to only the first 16 primary features, resulting in a .pt file with shape (N, 16) for further evaluation.

For further usage of these datasets in signal strength profiling and coverage evaluation, refer to the following sections.

Model Evaluation

All scripts related to model evaluation are located in the test/ folder. This stage uses a trained model to process pseudo datasets and compute the classifier outputs.

The following five files are included:

  • global_stats.pth: Stores the global mean and standard deviation of each feature, computed from the training set. Used for feature normalization.
  • model.pth: Contains the trained model parameters and can be directly loaded for inference.
  • derived_quantities.py: Dynamically computes derived features from the input primary features.

Core Programs

  • modely.py: The main evaluation script. It takes many pseudo datasets (in .pt format), loads the trained model, and computes the weighted signal-class probabilities for each event. It produces four histograms for each pseudo dataset:

    • One for the full region
    • Three for the control regions
      All histograms are saved as .pkl files for later use in likelihood fitting.
  • test_model.py: A wrapper script that allows the user to specify:

    • The directory of the input .pt files (pseudo datasets)
    • The directory of the output .pkl file path for storing histograms

These scripts are used to evaluate the classifier response on pseudo datasets with randomly sampled nuisance parameters, reflecting realistic variations expected in physical measurements. For instructions on how these histograms are used in likelihood fitting, see the next section.

Signal Strength Fitting

This section describes how to extract the signal strength and nuisance parameter estimates from pseudo datasets using maximum likelihood fitting.

The fitting pipeline consists of three programs, all located in the fitting/ folder:

  • modely.py: The main fitting script. It performs a profile likelihood scan over the signal strength $\mu$ and six nuisance parameters for a given pseudo dataset.

  • test_modely.py: A wrapper script that specifies:

    • The path to the .pkl file containing signal-class probability histograms (produced during model evaluation)
    • The path to the .pkl file containing the true values of signal strength and the six nuisance parameters used to generate the pseudo dataset
      After fitting, the estimated signal strength, confidence interval, and nuisance parameter estimates are written back into the same .pkl file that contains the true values.
  • write.py: A utility script that extracts the relevant information (true and fitted values of signal strength and nuisance parameters) from all .pkl files and writes them to a .txt file.
    This text file serves as the input for downstream visualization and coverage analysis.

The output of this stage includes:

  • Best-fit values of the signal strength $\mu$
  • 68.27% confidence intervals for $\mu$
  • Estimated values for all six nuisance parameters
  • Aggregated .txt file for further analysis and plotting

This completes the full pipeline from training to likelihood-based statistical inference.

Result Visualization

All scripts for generating plots are contained in the plots folder. These plots summarize the key results of the model evaluation and signal strength fitting. The main visualizations include:

  • Classifier Output Distributions
    Signal-class probabilities for the four processes are shown as both unweighted and weighted histograms. The weighted plot reflects variations across a grid of 11,849 fixed nuisance parameter points. Shaded bands indicate the bin-by-bin envelope formed by the maximum and minimum counts over these variations.

  • Coverage Studies for Signal Strength ($\mu$)
    For multiple values of the true signal strength ($\mu_\text{true}$), 1000 pseudo-experiments are visualized per value. Each plot shows the predicted 68.27% confidence interval, the true value ($\mu_\text{true}$), and the distribution of maximum likelihood estimates ($\hat{\mu}$). Coverage is computed as the fraction of intervals containing mu_true.

  • Correlation Plots
    Scatter plots show the correlation between $\mu_\text{true}$ and the estimated $\hat{\mu}$, as well as between each nuisance parameter and its corresponding profile estimate across 50,000 toy datasets.

  • Summary of Interval Width and Coverage
    The average width of the confidence interval for mu and the corresponding coverage are shown as functions of $\mu_\text{true}$, providing a quantitative summary of statistical performance.

These figures validate the model’s predictive performance and its robustness under nuisance parameter variations.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages