Skip to content

Development updates 20260515#24

Merged
jduprat merged 27 commits into
facebookresearch:mainfrom
jduprat:dev
May 15, 2026
Merged

Development updates 20260515#24
jduprat merged 27 commits into
facebookresearch:mainfrom
jduprat:dev

Conversation

@jduprat
Copy link
Copy Markdown
Contributor

@jduprat jduprat commented May 15, 2026

[New] Layout becomes purely affine; ComposedLayout carries every swizzled / non-affine form. [New] Promote split_outer_swizzle() to public API
[New] Revisited Exception hierarchy
[New] _address_bounds fast path for the canonical Sw o L form
[Fix] Tensor over a Layout-with-embedded-swizzle now adds the external offset AFTER the swizzle
[Fix] cosize() on embedded-swizzle Layout for non-power-of-2 shapes
[Fix] Tensor[(slice(None), 0), 1] now raises TypeError
[Fix] Drop typing.Self fallback that imported the undeclared typing_extensions
[Perf] Cache cosize() on each ComposedLayout and swizzled Layout instance
[Perf] _address_bounds: drop the embedded-vs-composed gate on the fast path
[Robustness] Make ComposedLayout's offset keyword-only
[Robustness] as_affine_layout() does an explicit is_affine() post-check
[Robustness] Validate warp_size > 0 in coalescing_efficiency / segment_analysis
[Robustness] viz raises an actionable ImportError when matplotlib/numpy are missing
[API] Align four pre-existing exception-class inconsistencies
[Refactor] Split layouts.py (4.4k LOC) into a layouts/ package with three layered modules. No public API change
[Refactor] Dedup bank_conflicts / per_group_bank_conflicts via shared helper
[Refactor] Dedup coalescing_efficiency / per_group_coalescing via shared helper
[Refactor] Move Layout._calculate_max_offset to module-level
[Refactor] Rename internal _affine_inner → _strip_swizzle
[Refactor] Hoist from math import gcd to module top in analysis.py
[Tests] Parametrize 32 AMD oracle C-layout per-atom tests into a single parametrized test
[Tests] Relax isinstance(R, Layout); R.swizzle == sw assertions to representation-tolerant structural / pointwise checks
[Tests] Add examples/composed.py to make examples
[Docs] Update layout_api.md / tensor_api.md / analysis_api.md / examples after Layout becomes purely affine

jduprat and others added 27 commits May 12, 2026 12:36
The staticmethod was always called as obj._calculate_max_offset(obj.shape,
obj.stride) -- it did not use 'self' at all and was already private. As a
free function it is easier to reuse from cosize() and from any future
helper that needs the affine span without going through a Layout instance.

Tests: full suite (881 passed, 2 skipped).
The local import inside order() saved nothing -- math.gcd is in the
standard library and importing at module top is the convention.

Tests: tests/analysis.py (173 passed).
Replace the docstring-only alias for as_layout() with an explicit affinity
post-check. The is_affine() assertion is belt-and-suspenders today (as_layout
already rejects ComposedLayout) but documents the contract at one point of
truth and protects against future loosening of as_layout(). Error message
points callers at as_layout_expr() for the non-affine path, and the docs
table now reflects the same.

Tests: full suite (881 passed, 2 skipped).
Internal helper that returns the underlying affine Layout with any embedded
Swizzle removed. The new name says what the helper does at the call site:
'strip the swizzle' is exactly the operation, used by flatten,
right_inverse / left_inverse, compose against the affine layer, and the
domain-only transforms. Also expand the docstring to explain why the
operation is recurring.

Mechanical rename across 7 call sites within layouts.py; no external
callers. No behavior change.

Tests: full suite (881 passed, 2 skipped).
The composed.py example was the only one in examples/ not exercised by
the smoke target. Adding it ensures the ComposedLayout demo stays in sync
with the API (same coverage gate as layouts.py / tensor.py / viz.py).

Tests: make examples (4 scripts, 167 SVG + 2 PNG + 2 PDF generated).
The Self import existed only because typing.Self is Python >=3.11 while
requires-python = '>=3.10'. The typing_extensions fallback was unsound:
typing_extensions was never declared in the dependencies, so a fresh 3.10
install would ImportError on import.

Single use site (Layout.squeeze) is a return-type annotation; with
'from __future__ import annotations' already in effect, annotations are
strings at runtime, so 'Layout' as a forward-ref string is equivalent for
the runtime and accepted by static type checkers (mypy/pyright resolve it
in scope).

Net effect: drops an undeclared optional dependency, removes a try/except
import block, no support-version change, no behavior change.

Tests: full suite (881 passed, 2 skipped).
Both functions executed the same per-thread-range bank-mapping kernel.
Extract _bank_conflicts_for_thread_range() and have both call it; the
single-group version passes start=0/end=min(thread_count,group_size),
the per-group version calls it once per group.

Single source of truth for the bank-conflict math; future bug-fixes or
behavior tweaks (e.g. handling 8-byte words) land in one place. Public
API and per-range result dict shape (conflict_free, max_ways,
bank_to_threads) are unchanged.

Tests: full suite (881 passed, 2 skipped).
Both functions ran the same per-thread-range cache-line / efficiency
computation on top of _group_access_offsets(). Extract
_coalescing_for_thread_range() and have both call it.

Single source of truth for the cache-line/efficiency math; future tweaks
(e.g. 32-byte segment counting, alignment-aware accounting) land in one
place. Public API and per-range result dict shape (transactions,
efficiency, cache_lines) are unchanged.

Tests: full suite (881 passed, 2 skipped).
Both functions take a warp_size kwarg and use min(thread_count, warp_size)
without validating warp_size. Negative or zero values silently produced
nonsense results (negative range or empty range, with 0/0 efficiency etc.).

Now matches the validation pattern already used by bank_conflicts and the
per_group_* helpers (group_size > 0).

Tests: tests/analysis.py (173 passed); manually exercised with
warp_size in {0, -4} to confirm ValueError.
The two key-walking helpers (_has_nested_none, _contains_free_coordinates)
were near-duplicates. _has_nested_none silently misclassified one case:
slice objects nested in a hierarchical coordinate tuple (e.g.
`T[(slice(None), 0), 1]`) returned False, so the code treated the key as
fully fixed and passed the slice through to slice_and_offset, which is
not contractually defined for slice objects.

Replace with _contains_free_coordinates at the call site and add a focused
_tuple_contains_slice helper that surfaces the misuse with an explicit
TypeError. Net helper count is unchanged; the diverging case is now an
explicit error rather than an undefined silent-pass.

Tests: full suite (882 passed, 2 skipped); new regression
test_tensor_slice_rejects_slice_inside_coordinate_tuple covers the
explicit reject.
cosize(ComposedLayout) is O(size(L)) -- it must enumerate the full
domain because the outer can be a Swizzle, a non-bijective Layout, or
another ComposedLayout that permutes/rescales the inner's image. The
result is invariant for a given (outer, inner, offset) triple, so cache
it once on the instance.

Implementation uses a declarative dataclass field with init/repr/eq/hash
all set to False, which keeps the cache slot out of __init__,
__repr__, __eq__ and __hash__ (so two structurally equal layouts with
different cache states still compare equal and remain dict-key
compatible). Frozen dataclass blocks normal __setattr__, so the lazy
write in cosize() goes through object.__setattr__.

Tests: full suite (884 passed, 2 skipped). New tests verify the cache
is read on the hot path (poison + observe) and that the cache is
invisible to equality/hashing.
The affine path in cosize() returned _affine_max_offset(shape, stride) + 1
even when the layout carried an embedded swizzle, ignoring the swizzle's
effect on the image. For non-power-of-2 affine images the swizzle's XOR
can flip a bit ABOVE the affine max:

    L = Layout(5, 1, swizzle=Swizzle(2, 0, 2))
    affine image: [0, 5)  -> declared cosize 5
    actual image: {0,1,2,3,5}  -> true cosize 6

For power-of-2 affine images the swizzle is a bijection on [0, 2^N) and
the two formulas agree, which is why the bug was invisible to the
existing test suite.

Mirrors the ComposedLayout fix in 5fbd19f for the embedded-swizzle form.
Add a swizzle-enumeration branch (same shape as the ComposedLayout one
but uncached for now -- a Layout cache slot will be added in a follow-up
to amortize the O(size) cost).

Tests: full suite (884 passed, 2 skipped); new regression
test_cosize_swizzled_layout_enumerates_image covers both the bug case
(L(5,1)+Sw -> 6) and the agreement case (L(16,1)+Sw -> 16).
Layout(shape, stride, swizzle=Sw) had its swizzle-aware cosize fixed in
79fa734 by enumerating max(L(i)) + 1, but each call paid O(size). Add a
_cached_cosize slot on Layout (mutable; the class is hand-written, not a
frozen dataclass) and have the swizzled branch read/write it the same
way the ComposedLayout cache works. Unswizzled layouts go through the
O(1) affine path and never touch the cache slot.

Coverage expansion:
  - test_cosize_swizzled_layout_caches_on_instance: poison + observe to
    verify the hot path reads the cache.
  - test_cosize_unswizzled_layout_does_not_populate_cache: unswizzled
    layouts skip the cache entirely (sanity check).
  - test_cosize_swizzled_layout_matches_composed_form: cross-form parity
    check across 4 shape/swizzle combinations -- embedded Layout and
    ComposedLayout(Sw, L, 0) must agree on cosize.
  - test_complement_consumes_corrected_swizzled_cosize: confirms the
    fix propagates downstream. complement uses cosize as its codomain
    bound; with the corrected cosize on Layout(5, 1, swizzle=Sw),
    complement returns a layout of size 2 / stride 5 instead of the
    degenerate result the buggy cosize=5 would have produced.

Tests: full suite (889 passed, 2 skipped).
Skip the O(size) per-coordinate walk in _validate_storage when the layout
is the canonical zero-offset Sw o L form (both ComposedLayout(Sw, L, 0)
and embedded-swizzle Layout). The image lives in [0, cosize(layout))
because Swizzle is a bit-permutation; with cosize cached on the instance
(65fd5c4 / de0269f) repeat calls are O(1).

Two preconditions:

1. Inner affine layout has non-negative strides (image starts at 0, so
   Sw(0) = 0 anchors the lower bound).

2. Tensor offset is added AFTER the swizzle. ComposedLayout always adds
   it after; embedded-swizzle Layout adds it BEFORE the swizzle (per
   tensor.py:38-41), which rotates the swizzle's input domain rather
   than translating the image. So the embedded form only takes the
   fast path at offset == 0; ComposedLayout takes it for any offset.

Five focused tests:

  - matches_walk: table-driven correctness check across both
    representations, both fast and slow paths, agreeing with explicit
    per-coordinate enumeration.
  - fast_path_taken_for_canonical_swizzle: whitebox poison check on
    both preconditioned forms.
  - slow_walk_for_negative_stride_inner: confirms the existing affine
    fast path still handles negative strides.
  - slow_walk_for_embedded_swizzle_with_nonzero_offset: pins the
    precondition I initially got wrong; embedded@offset>0 must walk.
  - slow_walk_for_inverse_form_composed: F6 inverse-form
    (Layout outer, Swizzle inner) walks correctly.

Tests: full suite (894 passed, 2 skipped).
matplotlib (and the numpy it pulls in) are an optional dependency of
the package, exposed via the [viz] extra. Direct submodule imports
like 'from tensor_layouts.viz import draw_layout' (the form the README
documents) used to surface a deep ModuleNotFoundError pointing at
matplotlib internals when the extra wasn't installed -- not actionable.

Wrap the matplotlib/numpy imports at module top in a single try/except
that re-raises ImportError with the install hint (chained from the
original error). Behavior with the extra installed is unchanged.

Tests: full suite (895 passed, 2 skipped). New regression
test_viz_module_raises_actionable_importerror_when_matplotlib_missing
simulates the missing-dependency case via a meta_path finder that
blocks matplotlib + numpy and asserts the wrapped message contains the
'pip install tensor-layouts[viz]' hint.
Same kind of failure was reported with different exception classes in
different places. Fix four small misalignments before introducing the
LayoutError / UnsupportedComposedLayoutError / TensorStorageError
hierarchy so the migration lands cleanly:

  1. to_F2_matrix's F6 inverse-form rejection: was ValueError, now
     NotImplementedError -- matches the canonical
     _reject_swizzle_inner_composed family in layouts.py.

  2. slice_modes / dice_modes 'tuple coord vs scalar shape' rejection:
     was TypeError, now ValueError -- matches the structure-mismatch
     precedent already used by crd2flat / crd2offset and other rank
     mismatches.

  3. prefix_product / suffix_product 'tuple init applied to scalar'
     rejection: was TypeError, now ValueError -- matches the Length
     mismatch sibling raise in the same function.

  4. _validate_order_permutation: 'order argument not iterable' was
     ValueError (wrapping the underlying TypeError), now TypeError --
     the precondition is a type problem, not a value one. The 'not a
     permutation' case at the next branch correctly stays ValueError.

Updated tests/analysis.py::test_to_F2_matrix_rejects_inverse_form_composed_layout
to expect NotImplementedError.

Tests: full suite (895 passed, 2 skipped).
…eError

Same family of failure was being reported with different exception
classes across the codebase (ValueError vs TypeError vs
NotImplementedError) and there was no way for a caller to catch
'layout-algebra error' or 'tensor-storage error' specifically without
matching message text.

Define three small marker classes at the top of layouts.py and migrate
the corresponding raise sites:

  - LayoutError(ValueError) for layout-algebra preconditions
    (shape/stride congruence, rank mismatch, mode out of range, tiler
    incompatibility, swizzle mask overlap, etc.). 42 raise sites in
    the layout algebra now use this.

  - UnsupportedComposedLayoutError(NotImplementedError) for the F6
    inverse-form / Swizzle-in-inner-slot ComposedLayout rejections
    (complement, coalesce, logical_product / logical_divide, slice
    decomposition, to_F2_matrix). Three raise sites canonicalised on
    this name. The unrelated 'this decomposition is not implemented'
    raise in from_F2_matrix stays as plain NotImplementedError.

  - TensorStorageError(ValueError) for Tensor storage-state errors
    (no backing storage on view/assign, layout addresses negative
    indices, storage too small). Four raise sites; the two 'no storage'
    cases were previously TypeError -- the type is right, the *state*
    is wrong, so the new class is more accurate.

All three subclass standard Python base classes, so existing
'except ValueError' / 'except NotImplementedError' handlers continue
to catch them. The new names are added to the package's __all__ so
they propagate via the existing star-import surface; user code can
'from tensor_layouts import LayoutError' and catch the specific kind.

Also updates two tests in tests/tensor.py that asserted on the
previous TypeError class for the 'no storage' cases.

Tests: full suite (895 passed, 2 skipped). All previously-passing
tests still pass; the two tests asserting on the prior TypeError class
were updated to expect TensorStorageError.
The oracle_amd.py file had 32 hand-written test functions, one per atom,
each calling validate_c_layout(<ATOM>, <arch>) with no other variation.
Adding a new atom required appending a new function with the same shape,
and the section banners doubled the boilerplate.

Replace the 32 functions with a single parametrized test driven by an
ORACLE_C_LAYOUT_CASES list of (atom, arch) pairs. The id= callback
reuses the atom name so per-case test IDs stay readable
(test_oracle_validate_c_layout[CDNA_32x32x8_F32F16F16_MFMA-cdna1] etc.).

Coverage parity verified: pytest --collect-only reports the same 32
parametrized cases, one per (atom, arch) pair as before. No change to
the arch tagging, no change to validate_c_layout. The structural
self-consistency tests below this section (TestMFMAStructural and
TestLayoutAlgebra, already parametrized over ALL_ATOMS) are unchanged.

Adding a new atom is now a one-line append to ORACLE_C_LAYOUT_CASES
instead of a copy-paste of the @requires_calculator + def + body.

Tests: full suite (1955 passed, 34 skipped). The 32 skipped come from
@requires_calculator -- this env doesn't have the AMD calculator
package installed; the gating semantics are unchanged.
…TER swizzle)

Previously a Tensor over a Layout-with-embedded-swizzle computed the
storage address as 'Sw(offset + L(coord))' -- the external offset was
folded into the swizzle's input domain. A Tensor over a ComposedLayout
already computed 'offset + ComposedLayout(coord)', adding the offset
linearly AFTER the layout call. The two forms thus disagreed on
addresses for nonzero Tensor offset, even though they denoted the same
function as layout expressions.

The asymmetry was inherited from the very first commit (6cde897, the
import from the Meta-internal predecessor). It pre-dated ComposedLayout
support entirely; ComposedLayout was added later (da0ea0e) with
deliberately CuTe-aligned semantics, and the embedded-swizzle path was
not re-examined against it.

This commit aligns the embedded-swizzle path with CuTe and with the
existing ComposedLayout path. After:

    tensor(coord) == tensor.offset + tensor.layout(coord)

for ALL Tensor forms. The Tensor's external offset is a pointer-style
shift; it never enters the swizzle's input domain.

Cross-referenced against:

  - CuTe C++ ComposedLayout::operator()
    cutlass/include/cute/layout_composed.hpp:114-120
        return layout_a()(offset() + layout_b()(coord));
        // (A o O o B)(c) = A(O + B(c))

  - CuTe C++ Tensor::operator[]
    cutlass/include/cute/tensor_impl.hpp:222-225
        return data()[layout()(coord)];
    The base offset lives in the data() iterator; the layout call is
    NOT given access to it. Slicing folds the slice contribution into
    the engine pointer:
        return make_tensor(data() + offset, sliced_layout);

  - CuTe canonical Tensor documentation
    cutlass/media/docs/cpp/cute/03_tensor.md:9
    'uses the result of the Layout computation to offset and dereference
    a random-access iterator held by the Engine.'

  - pycute (Python reference port -- layout side only; no Tensor concept)
    cutlass/python/pycute/swizzle.py:108-109
        return self.layoutB(self.offset + self.layoutA(*args))
    Same formula as CuTe C++ (just the opposite naming convention:
    pycute's layoutB == CuTe's layout_a == outer).

  - CuTe paper formal definition
    Tensor T = (Engine E, Layout L); T(coord) = E[L(coord)].

To make slicing chains preserve correctness under the new addressing
rule, slice_and_offset's bare-Layout-with-embedded-swizzle path now
mirrors what _slice_for_composition already did for the ComposedLayout
case: the slice's contribution is folded into a Form-B
ComposedLayout(Sw, sub_L, offset=delta) so the swizzle is applied to
'(delta + sub_L(coord))' inside the ComposedLayout, then the existing
affine-decay attempt is given a chance to reduce it back to a plain
Layout when the swizzle is affine on the surviving inner image. The
linear residue handed back to the Tensor is zero (or the post-decay
base offset). This matches CuTe slice_and_offset on a swizzled
ComposedLayout (cutlass/include/cute/swizzle_layout.hpp:230-262),
where the swizzle-interacting part of the slice is XORed into the
ComposedLayout's own offset and only the linear-residue part is
returned for the engine pointer.

Behavior change from a caller's point of view: only manifests when
constructing 'Tensor(swizzled_Layout, offset=k_nonzero, data=...)'
DIRECTLY. The pre-fix repo never exercised that pattern -- zero tests,
zero examples, zero notebooks construct it. Slicing chains on swizzled
Tensors continue to produce the same memory addresses as direct
indexing (verified pointwise across the existing test suite).

A UserWarning is raised at Tensor.__init__ for the affected pattern
(Layout-with-swizzle plus nonzero external offset) to flag the change.
The suggested replacement -- Tensor(ComposedLayout(swizzle, layout,
offset=k), offset=0) -- recovers the old fold-into-domain semantic
exactly, since ComposedLayout's own offset slot DOES enter the
swizzle's input domain via outer(comp.offset + inner(coord)).

Documentation: docs/tensor_api.md updated to state the unified rule and
remove the 'pre-swizzle linear address' wording that described the old
embedded-swizzle path.

Tests: 897 passed, 2 skipped. Eight tests that pinned the pre-fix
structural representation (e.g. asserting row.offset == 24 after
slicing, asserting the sub-Layout still carried .swizzle) were rewritten
to the functional contract (sub(coord) == orig(c)). One test
(test_swizzled_tensor_full_slice_matches_explicit_full_slice) wraps its
body in warnings.catch_warnings() because it deliberately exercises the
affected pattern. Two new positive regression tests pin the new
semantic explicitly:

  - test_tensor_embedded_swizzle_offset_added_after_swizzle: asserts
    Tensor(EmbSwL, offset=k)[coord] == k + Sw(L(coord)).
  - test_tensor_embedded_swizzle_and_composed_form_agree_under_offset:
    asserts the embedded and explicit ComposedLayout(Sw, L, 0) forms
    produce the same addresses for the same Tensor offset.

The full suite was also re-run with -W error::UserWarning to confirm
nothing in the test/example surface inadvertently trips the new
warning.
After the CuTe-aligned addressing fix (commit c19e378), the Tensor's
external offset is added linearly AFTER the layout call for both the
embedded-swizzle Layout and the explicit ComposedLayout(Sw, L, 0)
forms. The fast-path bound

    bounds = (offset, offset + cosize(layout) - 1)

is therefore correct for ANY offset on either form -- the previous gate
that restricted the embedded form to offset == 0 is no longer needed.

Simplifies the precondition from two to one (inner image starts at 0)
and shortens the fast-path comment accordingly. Behavior unchanged for
the cases the gate previously allowed; behavior improved (fast path
instead of slow O(size) walk) for embedded-swizzle Layout with nonzero
external offset.

Tests: full suite (897 passed, 2 skipped, 0 warnings under
-W error::UserWarning). Two tests updated:

  - test_address_bounds_slow_walk_for_embedded_swizzle_with_nonzero_offset
    is renamed to ..._fast_path_for_... and pins the fast-path bound
    explicitly (lo == offset, hi == offset + cosize - 1).

  - test_address_bounds_fast_path_taken_for_canonical_swizzle (the
    whitebox poison-the-cache test) now exercises the embedded form
    with a non-zero offset too.
…le form

Pure no-op refactor of the test surface in preparation for the Path X
representation collapse:

  Layout becomes purely affine.
  ComposedLayout is the single representation for swizzled forms.

This commit only touches tests and examples; source behavior is unchanged
and the suite still passes. The next commit flips producers and the
following removes the embedded-form carrier, this same test suite must
continue to pass without any further test-side changes -- so all
assertions that pinned the embedded form ("isinstance(R, Layout) and
R.swizzle == sw") have been relaxed to representation-tolerant equivalents.

Two kinds of edits:

  * Construction-site swap: Layout(.., swizzle=Sw) literals replaced
    with compose(Sw, Layout(...)) or ComposedLayout(Sw, Layout(...)).
    Both forms have identical address semantics post-c19e378, so the
    swap is purely surface.

  * Assertion-site relaxation: "isinstance(R, Layout); R.swizzle =="
    pins replaced with "isinstance(R, (Layout, ComposedLayout))" plus
    a structural / pointwise equality check that holds for either
    representation.
…ded swizzle in-tree

After this commit no in-tree code constructs an embedded-swizzle Layout
via the algebra; ComposedLayout is the canonical representation for every
swizzled form. The `Layout(..., swizzle=...)` constructor kwarg is still
accepted (removed in the next commit) for backward compatibility while
the slot lives on the class.

Producer flips (src/tensor_layouts/layouts.py):

  * _compose_swizzle_lhs: always returns ComposedLayout(swizzle, layout_b)
    instead of decaying to Layout(.., swizzle=swizzle) for affine layout_b.
    This is the single most cascading change -- every `compose(Sw, L)`
    call site now produces a ComposedLayout.
  * _compose_layout_with_layout: drops the
    `if layout_b.swizzle is not None: return ComposedLayout(layout_a, layout_b)`
    short-circuit that previously kept compose(L, embedded_L) intact.
  * _compose_with_composed_rhs: now keeps the swizzled wrapper intact
    instead of associating compose(L, ComposedLayout(Sw, L', 0)) into
    compose(compose(L, Sw), L'). Pre-Path-X this path was hidden by the
    short-circuit above; the underlying swizzle-transfer in
    _compose_with_swizzle_rhs is not pointwise-correct for hierarchical
    affine outers. Pure-affine ComposedLayout (Layout outer) still
    associates safely.
  * compose() dispatcher: drops the LHS-Layout-with-swizzle arm at line
    3434-3435 (now unreachable from in-tree producers). The helper
    _compose_swizzled_layout_lhs is left in place; deleted in C3.
  * _forward_layout_domain: the legacy embedded-swizzle branch now always
    promotes to ComposedLayout (no more `Layout(.., swizzle=...)`
    rewrap fast path).
  * Layout.__call__ slice path, Layout.squeeze, Layout.filter,
    Layout.flatten, mode(): drop the `swizzle=self._swizzle` kwarg.
    Path X Layout is purely affine; in-tree callers never reach these
    paths with a swizzled self.
  * right_inverse, left_inverse: drop the `Layout.swizzle is not None`
    fast paths; only the ComposedLayout(Swizzle, Layout, 0) arm remains.
  * slice_and_offset, _slice_for_composition: collapse to the affine slice;
    the legacy Form-B promotion (slice contribution folded into a
    ComposedLayout(Sw, sub_L, offset=delta)) is now unreachable for bare
    Layout because no Layout is ever swizzled in-tree.
  * _try_decay_swizzle_composed: drop the redundant
    `inner.swizzle is not None` rejection.
  * logical_product: drop the `inner.swizzle is None` extra check on
    the swizzled-tile fast path; drop the
    `Layout(.., swizzle=embedded_swizzle)` reattachment in the generic
    fallback (no longer needed; _logical_product_with_swizzled_tile
    already returns the right ComposedLayout form).

Consumer updates:

  * src/tensor_layouts/tensor.py:
      - _tensor_address: legacy Layout.swizzle arm preserved as a
        compatibility branch while the kwarg is accepted; removed in C3.
      - _address_bounds: precondition on the affine fast path uses
        getattr(layout, 'swizzle', None) is None so the canonical path
        is taken even when split helpers handle the swizzled form.
      - Tensor.__getitem__: drop `swizzle=sub.swizzle` in the slice-result
        reconstruction.
      - Tensor.__init__ embedded-form warning: left in place for C2
        (warning is unreachable from in-tree code); deleted in C3.

  * src/tensor_layouts/analysis.py:
      - to_F2_matrix: legacy Layout.swizzle post-composition preserved
        with getattr; in-tree callers never hit it.
      - from_F2_matrix: now constructs ComposedLayout(sw, Layout, 0)
        instead of Layout(.., swizzle=sw).

  * src/tensor_layouts/viz.py:
      - _normalize_display_layout: drops `swizzle=layout.swizzle` kwarg.
      - _eval_layout_with_offset: always takes the affine branch on a
        bare Layout (no embedded-swizzle apply step).
      - _layout_expr_with_offset: always uses the identity-outer
        ComposedLayout to internalise the external offset.
After the previous commit, nothing in-tree produced or read embedded-swizzle
Layout; this commit removes the carrier itself and all the dispatch arms
that fed it.

src/tensor_layouts/layouts.py:
  * Layout.__init__: drop the `swizzle=` kwarg and the `self._swizzle`
    slot. Drop the `self._cached_cosize` slot (cosize is closed-form
    O(1) for affine Layout; the cache lived only to amortise the
    swizzle-aware enumeration that no longer applies to Layout).
  * Layout.swizzle property: deleted.
  * Layout.__eq__: drop the `_swizzle` term.
  * Layout.__hash__: drop the `_swizzle` hash term; reduce to
    hash((shape, stride)).
  * Layout.__repr__: collapse to the single-form
    `Layout(shape, stride)`; the eval-roundtrip is now exact.
  * Layout.__str__: drop the `(Sw) o (...)` wrapper; simple
    `shape : stride` notation.
  * Layout.__call__: drop the swizzle post-application; coordinate
    evaluation is just `crd2offset(coords, shape, stride)`.
  * cosize(): drop the embedded-swizzle Layout cache branch.
  * _strip_swizzle(): deleted (no callers).
  * _split_zero_offset_swizzle(): keep only the
    `ComposedLayout(Sw, L, 0)` arm; the Layout arm is gone.
  * _compose_swizzled_layout_lhs(): deleted (was unreachable since C2).
  * compose() dispatcher comment cleaned up.

src/tensor_layouts/tensor.py:
  * _tensor_address: collapsed to
    `return offset + crd2offset(coords, layout.shape, layout.stride)`
    on the Layout branch (no more swizzle post-application).
  * _address_bounds: simplified affine fast-path; the
    `getattr(layout, 'swizzle', None) is None` guard is gone since
    Layout has no swizzle attribute. ComposedLayout split-handling
    below is unchanged.
  * Tensor.__init__: deleted the `Tensor(swizzled Layout, offset!=0)`
    UserWarning block and the back-compat folding documentation.
    The warned-about case is no longer constructible -- Layout has no
    swizzle attribute, so `isinstance(self._layout, Layout) and
    self._layout.swizzle is not None` is structurally impossible.
  * import warnings kept as `# noqa: F401` for re-export stability.

src/tensor_layouts/analysis.py:
  * to_F2_matrix: dropped the `if layout.swizzle is not None`
    post-composition arm; bare Layout always goes through the affine
    column-build path.
Tracking the representation collapse in user-facing prose. All
behaviour changes in the last 2 commits; this commit only
updates documentation, example narration, and one example assertion
that pinned the legacy embedded-swizzle Layout shape.

docs/layout_api.md:
  * 'Layout Expressions and ComposedLayout' section: drop the
    'Layout may also carry one canonical final swizzle' framing;
    Layout is now purely affine and ComposedLayout is the home for
    every non-affine form.
  * 'When compose() returns Layout vs ComposedLayout' section:
    the canonical Sw o L now uniformly returns ComposedLayout(Sw, L, 0)
    instead of Layout(.., swizzle=Sw). Updated the example output to
    match the new repr; bare Layout returns are documented as 'both
    operands affine' only.
  * 'Example: canonical fast path vs exact fallback' section: updated
    the type assertions to match the new ComposedLayout return.
  * compose() reference paragraph: drop the 'returns a Layout with an
    embedded swizzle' shortcut wording.

docs/tensor_api.md:
  * 'Composed Layouts' section: drop the dual-form
    `Tensor(Layout(.., swizzle=Sw))` mention; only
    `Tensor(ComposedLayout(Sw, L, k))` survives. The address rule
    `tensor(coord) == tensor.offset + tensor.layout(coord)` is unchanged.

docs/analysis_api.md:
  * to_F2_matrix example: drop the embedded-form input; show only the
    canonical ComposedLayout form.
  * from_F2_matrix description: now returns a LayoutExpr (Layout or
    ComposedLayout(Sw, Layout, 0)) instead of optionally embedded.
  * Round-trip example uses the canonical ComposedLayout form.

examples/composed.py:
  * example_fast_path renamed in spirit ('canonical swizzled form ...
    returns a ComposedLayout'); pinned to the new structure
    (assert isinstance(swizzled, ComposedLayout); swizzled.outer ==
    Swizzle(...); swizzled.inner == base).

examples/layouts.py:
  * example_swizzle docstring: 'embeds the swizzle inside the Layout'
    -> 'produces a ComposedLayout(Swizzle, Layout, 0) -- the canonical
    Path X representation'.
…te order difference

tensor-layouts uses `ComposedLayout(outer, inner, offset=k)` while CuTe
C++ and pycute place the offset positionally between the outer and inner
slots: `ComposedLayout<A, Offset, B>` / `ComposedLayout(layoutB, offset,
layoutA)`. The Python ordering is the more ergonomic choice for the
common zero-offset canonical `Sw o L` case (the parameter can drop
entirely), but it creates a porting trap: someone copying a CuTe
positional literal into Python could accidentally write
`ComposedLayout(Sw, k, L)` and get... well, today, a clear TypeError on
the `inner` type-check (`int is not Layout/ComposedLayout/Swizzle`),
but the failure mode wasn't obvious from the constructor signature.

Promote `offset` to a keyword-only field so both shapes of the trap fail
at the call-site (positional argument count) rather than later in
`__post_init__`:

  ComposedLayout(Sw, L, 4)           # tensor-layouts positional offset -> rejected
  ComposedLayout(Sw, 4, L)           # CuTe-style positional order      -> rejected

  ComposedLayout(Sw, L)              # default offset=0  (canonical)    -> works
  ComposedLayout(Sw, L, offset=4)    # explicit non-zero offset         -> works

All in-tree call sites already use `offset=` keyword form (audited
across src/, tests/, examples/), so this is a zero-breakage hardening.

Implementation: `offset: int = field(default=0, kw_only=True)` on the
frozen dataclass (Python 3.10+ `kw_only` field option).

Docs: new 'Constructor signature vs CuTe / pycute' subsection in
docs/layout_api.md. Includes a 3-row comparison table and explicit
shows-and-tells for the rejected positional shapes. The semantics
section above it now uses `offset=k` notation everywhere to match the
constructor.

Tests: 2240 passed, 142 skipped (unchanged).
Summary:
``_split_zero_offset_swizzle`` was imported privately by ``tensor.py`` to
gate an O(1) fast path in ``_address_bounds``, an abstraction leak. It
also matches the structural query that ``max_common_layout`` /
``max_common_vector`` already use internally. Promote it to public:

- Renamed to ``split_outer_swizzle`` -- the prior name overpromised; it
  only recognises the canonical ``ComposedLayout(Sw, L, offset=0)``
  form, not the inverse-form ``ComposedLayout(L, Sw, offset)`` produced
  by ``right_inverse`` / ``left_inverse``. The "outer" qualifier names
  the slot the Swizzle occupies; pairs with the existing private
  predicate ``_is_swizzle_inner_composed``.
- Expanded docstring: states what is and isn't recognised, explains why
  the inverse-form is intentionally excluded (different semantics, can
  emit negative addresses), and points at where to grow a sibling
  ``split_inner_swizzle`` if a public consumer ever appears.
- Added to ``__all__`` next to the other swizzle exports.
- Dropped the private back-channel ``from .layouts import
  _split_zero_offset_swizzle`` from ``tensor.py``; the public name
  arrives via the existing ``from .layouts import *``.

Test Plan: ``make test`` -- 2240 passed, 142 skipped (unchanged).
Promote ``src/tensor_layouts/layouts.py`` (4.4k LOC) to a package with
three layered modules and an aggregating ``__init__.py``:

* ``core.py`` -- exceptions, type predicates, tuple operations, the
  affine ``Layout`` class, ``Tile``, and the ``Swizzle`` primitive. No
  dependency on ``ComposedLayout``; the import graph now enforces what
  used to be a convention.
* ``expr.py`` -- the ``LayoutExpr`` layer: ``ComposedLayout`` plus every
  predicate / coercer that operates on ``LayoutExpr = Layout |
  ComposedLayout`` (``is_layout``, ``is_affine``, ``as_layout``,
  ``as_layout_expr``, ``as_affine_layout``, ``split_outer_swizzle``,
  ``_forward_layout_domain``).
* ``algebra.py`` -- the CuTe layout algebra (compose, complement,
  divide, product, inverses, coalesce, idx2crd, upcast/downcast, ...).

Dependency direction is strictly ``core <- expr <- algebra``.

The package's ``__all__`` is the union of the three submodules' own
``__all__`` lists (rather than a hand-curated 89-name copy that drifts
in practice -- this fix surfaces ``coords_all_none``, which the curated
list had silently dropped). Private symbols still consumed by other
in-package modules (``_NO_FORWARD``, ``_forward_layout_domain``, ...)
are explicitly re-exported from ``__init__.py``.

No public API change: every name previously importable from
``tensor_layouts.layouts`` remains importable from the same path. All
2350 tests still pass.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 15, 2026
@jduprat jduprat merged commit 75ec7ad into facebookresearch:main May 15, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant