Skip to content

feat(units): add an expression parser#173

Open
HaoZeke wants to merge 8 commits intometatensor:mainfrom
HaoZeke:feat/unit-expression-parser
Open

feat(units): add an expression parser#173
HaoZeke wants to merge 8 commits intometatensor:mainfrom
HaoZeke:feat/unit-expression-parser

Conversation

@HaoZeke
Copy link
Member

@HaoZeke HaoZeke commented Mar 4, 2026

Closes #154.

Replaced the per-quantity lookup tables with a Shunting-Yard expression parser
that works on arbitrary compound unit strings in the spirit of lumol.

"kJ/mol/A^2"  -->  tokenize  -->  shunting-yard  -->  AST  -->  eval
                   [kJ,/,mol,     [kJ,mol,/,         tree     {factor, dim}
                    /,A,^,2]       A,2,^,/]

Each token resolves to an SI conversion factor and a 5-element dimension vector
[L, T, M, Q, Theta]. The parser composes these through multiplication, division,
and exponentiation. Conversion factor between two expressions = ratio of their
SI factors after verifying dimension equality.

API changes

| Before (3-arg)                                  | After (2-arg)                                    |
|-------------------------------------------------+--------------------------------------------------|
| ~unit_conversion_factor("energy", "eV", "meV")~   | ~unit_conversion_factor("eV", "meV")~              |
| ~unit_conversion_factor("force", "eV/A", "eV/A")~ | ~unit_conversion_factor("eV/A", "eV/A")~           |
| Not possible                                    | ~unit_conversion_factor("(eV*u)^(1/2)", "u*A/fs")~ |

Expression syntax

Operators: * (multiply), / (divide), ^ (power), () (grouping).
Whitespace ignored. Case-insensitive. Numeric literals allowed in exponents.
Fractional exponents via parenthesized division: ^(1/2).

Token table

Single flat unordered_map with 30+ entries covering length (angstrom, bohr, nm,
m, cm, mm, um), energy (eV, meV, hartree, ry, joule, kcal, kJ), time (fs, ps),
mass (u, kg, g, electronmass), charge (e, coulomb), dimensionless (mol), and
derived (hbar).

Notes

kelvin is NOT in the token table because temperature conversions between
offset-based scales (Celsius, Fahrenheit) are non-multiplicative.
DIM_TEMPERATURE exists as dimension [0,0,0,0,1] for potential future use but
no tokens currently carry it. (maybe once we do an API break, can revisit during mini-metatomic)

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

Reviewer checklist

  • CHANGELOG updated with public API or any other important changes?

Port lumol's Rust expression parser to C++, enabling compound unit
expressions like kJ/mol/A^2 and (eV*u)^(1/2) with automatic
dimensional validation. Uses SI as internal reference frame to
handle non-coherent base units correctly.

Adds 2-arg unit_conversion_factor(from, to) alongside the existing
3-arg form (now deprecated). Updates internal C++ callers in
model.cpp and system.cpp to the new API.
Register unit_conversion_factor_v2 TorchScript op and add Python
dispatcher that routes 2-arg calls to the new parser and 3-arg
calls (with deprecation warning) to the legacy wrapper. Update
TorchScript callers in model.py to use v2 directly.
C++ tests (15 cases, 29 assertions): simple conversions, compound
expressions, fractional powers, case insensitivity, dimension
mismatch errors, unknown tokens, backward compat with 3-arg API.

Python tests (12 functions): 2-arg API, 3-arg deprecation, ASE
cross-validation, compound expressions, error handling, empty
string identity, valid unit validation.
Replace per-quantity unit tables in misc.rst with a flat token table
grouped by SI dimension, compound expression examples, and the new
2-arg API. Add changelog entries for the parser and deprecation.
@HaoZeke HaoZeke requested review from GardevoirX and Luthaf March 4, 2026 11:10
HaoZeke added 2 commits March 4, 2026 11:23
- models.cpp: error message changed from "unknown unit 'X' for Y" to
  "unknown unit token 'X'" after replacing per-quantity lookup with
  expression parser
- models.cpp: use valid unit ("eV") in JSON serialization test instead
  of "something" which the parser rejects
- model.cpp: sort quantity names in warning for deterministic output
  (unordered_map iteration order is not guaranteed)
- cxx/misc.rst: disambiguate doxygen reference for overloaded
  unit_conversion_factor (2-arg and 3-arg)
@HaoZeke HaoZeke force-pushed the feat/unit-expression-parser branch 2 times, most recently from 979302f to 92bf24c Compare March 4, 2026 11:59
@HaoZeke HaoZeke force-pushed the feat/unit-expression-parser branch from 92bf24c to 0fe0ef7 Compare March 4, 2026 12:05
@GardevoirX
Copy link
Contributor

Thanks a lot! I think it would be better if we can use the new functionality to check if the quantity and unit match, when initializing ModelOutput here
https://github.com/HaoZeke/metatomic/blob/0fe0ef73e82b089811278472acba56267578b80c/metatomic-torch/include/metatomic/torch/model.hpp#L48-L61

Address PR metatensor#173 review feedback from GardevoirX:
- Add s, second, ms, us, ns, ps with full-word aliases to time tokens
- Add tests verifying ModelOutput rejects mismatched quantity/unit dims
- Add tests for standalone micro sign (U+00B5) -> Dalton resolution
- Update docs token table and doxygen with new time unit coverage
- Fix stray dash in RST list-table Dimensionless row
@HaoZeke HaoZeke requested a review from GardevoirX March 4, 2026 15:26
Copy link
Contributor

@GardevoirX GardevoirX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, love it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use a proper expression parser for unit conversions

3 participants