diff --git a/sites/docs/app/global.css b/sites/docs/app/global.css
index 4c5a450..56dd68e 100644
--- a/sites/docs/app/global.css
+++ b/sites/docs/app/global.css
@@ -1224,6 +1224,14 @@ figure.shiki code {
font-variant-numeric: tabular-nums;
}
+.docs-body h3.no-counter {
+ counter-increment: none;
+}
+
+.docs-body h3.no-counter::before {
+ display: none;
+}
+
@media (max-width: 760px) {
.site-top {
align-items: flex-start;
diff --git a/sites/docs/content/docs/challenge-reference/index.mdx b/sites/docs/content/docs/challenge-reference/index.mdx
index 52f5b9a..f2d5087 100644
--- a/sites/docs/content/docs/challenge-reference/index.mdx
+++ b/sites/docs/content/docs/challenge-reference/index.mdx
@@ -9,5 +9,6 @@ these shared definitions when needed.
## Pages
+- [Open Problems](/docs/challenge-reference/research-challenges-doc): description of open problems in physics and AI/ML research
- [Challenge Rules](/docs/challenge-reference/challenge-rules): shared structure, validation, and submission expectations.
- [Metrics](/docs/challenge-reference/metrics): scoring definitions and rationale used by challenge pages.
diff --git a/sites/docs/content/docs/challenge-reference/meta.json b/sites/docs/content/docs/challenge-reference/meta.json
index 2b80574..4d5317c 100644
--- a/sites/docs/content/docs/challenge-reference/meta.json
+++ b/sites/docs/content/docs/challenge-reference/meta.json
@@ -3,6 +3,7 @@
"defaultOpen": true,
"pages": [
"challenge-rules",
- "metrics"
+ "metrics",
+ "research-challenges-doc"
]
}
diff --git a/sites/docs/content/docs/challenge-reference/research-challenges-doc.mdx b/sites/docs/content/docs/challenge-reference/research-challenges-doc.mdx
new file mode 100644
index 0000000..96c9b69
--- /dev/null
+++ b/sites/docs/content/docs/challenge-reference/research-challenges-doc.mdx
@@ -0,0 +1,92 @@
+---
+title: Open Problems and Challenges
+description: Describe the open problems in both physics and AI/ML research, and the related open challenges
+---
+
+# Research Challenges
+
+To better decouple different aspects of research, we categorize the challenges as physical and AI/ML-oriented.
+
+## Physics Open Problems
+
+The **Monte Carlo (MC) method** is a cornerstone of Experimental High Energy Physics (E-HEP). Instead of solving complex physics equations analytically, researchers generate enormous numbers of simulated "universes," each sampled from a space of possible physics models and measurement uncertainties. If the real universe's behavior falls within that simulated ensemble — and we can quantify *how confidently* it does — we have a principled, statistical way to test our theories against reality.
+
+But MC simulation is not without costs. As experiments grow more ambitious — larger detectors, subtler signals, tighter precision targets — the method's limitations are becoming active bottlenecks.
+
+### What to model?
+
+Every MC simulation begins with a design question: which physics processes are worth implementing, and at what level of detail? The obvious cases are easy — a water Cherenkov detector does not need a liquid Argon scintillation model. But the boundaries blur quickly. Should one implement Raman scattering alongside Rayleigh and Mie, given that it contributes far less to light propagation? A hand-wavy "it'll be smeared by systematics" might be defensible in isolation, but accumulated across dozens of such decisions, small omissions can quietly propagate into the final analysis.
+
+A fully complete model is rarely an option either — tracking every photon from every point in the detector toward every one of thousands of light sensors is combinatorially infeasible. The right approximation must be *accurate enough* without being prohibitively expensive, and there is no universal answer.
+
+In the end, verifying that the MC faithfully reproduces *known* physics — using data deliberately chosen to be independent of the measurement target — becomes a critical task for every experiment. This is easier said than done, and it leads directly to the next open problem.
+
+### Can we ever fully validate?
+
+The short answer is no. Calibration — tuning the MC against real data where the answer is already known — builds confidence that the simulation faithfully represents the detector. But it is never exhaustive. Neither the HK nor DUNE detectors can be probed at every location, energy, and particle type within real experimental constraints.
+
+Worse, some calibration is not achievable by design. Calibration sources cover a finite regime; real signals do not respect those boundaries. Extrapolating MC validity beyond the calibrated region requires assumptions that cannot be directly tested, and assuming smooth, interpolable detector responses is itself a modeling decision that loops back to the first open problem.
+
+The above points have a direct and underappreciated consequence for reconstruction. Algorithms that identify and classify particle interactions are developed and validated entirely on MC. Conventional methods often rely on computational shortcuts with no physical motivation, such as Gaussian likelihoods chosen for tractability rather than accuracy. Moreover, such mis-modeling is usually hard to capture with the associated systematic errors, for example trying to model a binomial distribution with Gaussian errors. Therefore, every MC simulation carries a set of quietly unverified assumptions into the final analysis — a fact that motivates the third open problem.
+
+### When MC meets reality
+
+Discrepancies between MC and real data are not rare — they are expected. But when they appear after unblinding, they cannot simply be patched: researchers commit to their analysis strategy *before* seeing real data precisely to avoid tailoring the simulation toward a preferred answer.
+
+So what does a gap mean? The possibilities are uncomfortably broad — new physics, an implementation error, a missing model, or a statistical fluctuation — and distinguishing between them is rarely clean. Worse, a convincing explanation for one experiment's discrepancy may be entirely wrong for another. Solutions tend to be ad-hoc, tuned to a specific detector and systematic error budget. A universal understanding of MC-data tension has eluded the field for decades — and any approach that hopes to generalize across experiments must contend with all three layers of uncertainty described above.
+
+
Open Challenge P-I - Panoptic Segmentation on 3D Points
+
+[View challenge](/challenges/fm-panoptic-segmentation)
+
+Previous works on supervised ML approaches have shown great potential in achieving both hit-level semantic segmentation and particle instance clustering (cite SPINE, WatchMaL). However, such models inherit a subtle problem from the MC modeling: a model trained on MC-labeled images learns to reproduce MC's embedded assumptions, not necessarily the true underlying physics. And since real detector data carries no ground-truth labels, there is no direct way to verify whether the reconstruction is right — only whether it is internally consistent with the MC it was trained on. This is the core difficulty of panoptic segmentation in detectors like LArTPCs: it is not merely a vision problem, but a label-free generalization problem with no clean validation path.
+
+Open Challenge P-II - Ring Counting and Particle Identification
+
+Water Cherenkov blablabla
+
+Open Challenge P-III - Cross-detector Transfer
+
+[View Challenge](/challenges/fm-event-completion)
+
+The cancellation of flux and cross section systematic uncertainties by near-far detectors is a classic technique in long baseline neutrino oscillation experiments. The assumption is neat: physics doesn't change with observer's location. However, this assumption carries an implicit requirement---an observer should fully understand the detector responses and be able to infer the underlying true physics, which, as the preceding open problems illustrate, is nearly impossible in practice. The current working approach is to marginalize over detector systematic uncertainties when propagating flux and cross section constraints — essentially a prior-posterior fit between MC and data. As next-generation experiments like HK and DUNE demand increasingly tight systematic budgets, this approach is becoming an active bottleneck.
+
+A data-driven foundation model offers a principled alternative. Rather than marginalizing over detector differences, such a model would encode detector-specific responses into a shared representation space — allowing the underlying physics to be transferred across detectors without being entangled with their individual systematics.
+
+Open Challenge P-IV - Generator Surrogate Models
+
+
+
+Open Challenge P-V - Detector Surrogate Models
+
+
+
+## AI/ML Open Problems
+
+The rapid advancement of AI/ML — and foundation models in particular — offers promising tools for the physics problems described above. But deploying these methods in a rigorous scientific context introduces its own set of difficulties. We categorize them below, in rough correspondence with the physics problems.
+
+### How big can we go?
+
+Modern foundation models are large by design — their generalization power comes precisely from scale. Neutrino physics detectors are also large by nature, and demand high spatial and timing resolution. A single LArTPC event can span hundreds of thousands of voxels; a water Cherenkov event reconstruction must account for thousands of light sensors responding in temporal patterns. Naively scaling up a vision transformer or graph neural network to handle these inputs quickly exhausts GPU memory and compute budgets that most research groups can realistically afford.
+
+This is not merely an engineering inconvenience. The approximations one makes to fit a model into available hardware — downsampling, aggressive pruning, reduced precision — are themselves modeling decisions with unclear downstream consequences. The cost-benefit tradeoff between model capacity and computational feasibility mirrors the MC modeling problem almost exactly, except now the unknown is not a physics process but a learned representation.
+
+### Anomaly or bug?
+
+MC simulations let physicists implement any physics processes, including the rare ones. However, generating sufficient MC statistics to represent these rare processes faithfully is itself computationally prohibitive. A model trained on such imbalanced data learns the common cases well and the rare ones poorly — and standard performance metrics like overall accuracy will not expose this, because the rare cases contribute negligibly to the aggregate score.
+
+ML suffers from the same problem. A rare event that falls in a sparsely populated region of the learned representation space may not trigger any failure mode visibly — it is simply embedded poorly, its features averaged into nearby common events, with no indication that the representation has broken down. The model does not know what it does not know. Designing training objectives and evaluation protocols that surface these quiet failures — rather than rewarding average-case performance — is an open and largely unsolved problem in ML, and one that matters acutely in physics where rare processes might carry the most scientific weight.
+
+### Can physicists trust it?
+
+Physics analyses are not classification competitions. A result submitted to a journal — or used to claim evidence of new physics — must come with rigorously quantified uncertainties, traceable systematic error budgets, and decision logic that can be scrutinized and challenged by the community. A black-box model that achieves impressive benchmark performance but cannot explain *why* it made a decision is, for most physics purposes, not yet useful.
+
+One specific task of this problem is the quantification of model uncertainty. It is not enough to attach an uncertainty estimate to a model's output — the definition of uncertainty shifts fundamentally from conventional physics analyses, where researchers begin with explicit prior assumptions and derive posteriors through well-established statistical frameworks. **In a data-driven approach, what does a systematic uncertainty even mean?** Nonetheless it is at least clear that the estimate of uncertainties must be *calibrated*, meaning it should faithfully reflect the true probability of being wrong, including in regimes where the model has never been tested. This connects directly back to the validation problem in the physics challenges: an uninterpretable model trained on MC inherits all of MC's unverified assumptions, with no mechanism to expose or quantify them.
+
+Open Challenge M-I — Event Completion
+
+[View Challenge](/challenges/fm-event-completion/)
+
+How do we make sure that a model is learning the representation for detector responses instead of a trivial solution for the seen particle event topology? This leads to the event completion problem: if a model could predict the unobserved parts of an interaction from the observed ones, the representation is likely to have captured the detector physics.
+
+It maps naturally onto the masked autoencoding framework that has driven recent advances in vision and language foundation models. Given a partial observation of a particle interaction, the model learns to predict the masked region — not as a generative exercise, but as a physically constrained inference problem. A model that can do this reliably has, implicitly, learned a representation of the underlying physics that generalizes beyond the specific detector configuration it was trained on. This makes event completion a compelling pretraining objective for a neutrino physics foundation model: it is self-supervised, requires no ground-truth labels from MC, and directly probes whether the learned representation captures real detector physics.
\ No newline at end of file