diagrams

t81dev · t81dev · commit 5571bc5ea4ee · 2025-12-12T11:22:10.000-05:00
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -66,8 +66,8 @@ in `docs/design/limb-design.md`, `docs/design/bigint-design.md`, and `docs/desig
 
 ## Visual guides
 
-- `docs/diagrams/architecture-stack-mermaid.md` — module layering from application to helper utilities.
-- `docs/diagrams/build-flow-mermaid.md` — configure/build/test/bindings workflow so new contributors can run the project end-to-end.
+- `docs/diagrams/core-architecture.mermaid.md` — layered helpers from `limb` → `bigint` → the umbrella helpers and GEMM stack.
+- [`docs/diagrams/docs-sitemap.mermaid.md`](diagrams/docs-sitemap.mermaid.md) — site map summarizing the docs portal and related resources.
 
 This guide is intentionally light and developer-facing—if you need a runnable overview, `docs/index.md`
 acts as the higher-level docs portal introduced in the README.
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -43,6 +43,10 @@ Each row contains:
 
 Use this CSV to plot accuracy vs. storage or compare latency across the three modes.
 
+## Diagrams
+
+View the [benchmark comparison diagram](docs/diagrams/benchmarks.mermaid.md) for a quick latency/storage summary that highlights the 15–22× wins.
+
 ## Results sharing
 
 When opening a pull request, add the latest benchmark rows (or a summary table) to this file or reference the CSV as part of your performance discussion so reviewers can reproduce the numbers.
diff --git a/README.md b/README.md
@@ -102,7 +102,7 @@ target_link_libraries(... t81::t81lib)
 
 ## GPU backends
 
-Optional CUDA/ROCm backends can be enabled with `-DUSE_CUDA=ON` / `-DUSE_ROCM=ON` so the Python bindings link against the GPU kernels. `t81lib` exposes a compact `TensorMetadata` ABI that carries device, dtype, shape, and stride info, allowing `where`, `clamp`, `lerp`, and `addcmul` to work directly on NumPy arrays or Torch tensors. See [docs/gpu.md](docs/gpu.md) and [docs/torch.md](docs/torch.md) for build flags, device routing, supported ops, and lifetime details.
+Optional CUDA/ROCm backends can be enabled with `-DUSE_CUDA=ON` / `-DUSE_ROCM=ON` so the Python bindings link against the GPU kernels. `t81lib` exposes a compact `TensorMetadata` ABI that carries device, dtype, shape, and stride info, allowing `where`, `clamp`, `lerp`, and `addcmul` to work directly on NumPy arrays or Torch tensors. See [docs/gpu.md](docs/gpu.md), [docs/torch.md](docs/torch.md), and the [GPU dispatch diagram](docs/diagrams/gpu-dispatch.mermaid.md) for build flags, device routing, supported ops, and lifetime details.
 
 ## CLI helpers
 
diff --git a/docs/api-overview.md b/docs/api-overview.md
@@ -1,6 +1,6 @@
 # API overview
 
-This page captures the high-level helpers exposed by the umbrella header so you can understand the building blocks without diving into every header.
+This page captures the high-level helpers exposed by the umbrella header so you can understand the building blocks without diving into every header. Review the [core architecture diagram](diagrams/core-architecture.mermaid.md) for an inheritance/data-flow sketch of the same helpers.
 
 ## Core numerics
 
diff --git a/docs/diagrams/benchmarks.mermaid.md b/docs/diagrams/benchmarks.mermaid.md
@@ -0,0 +1,9 @@
+```mermaid
+pie title Latency & storage comparison (relative)
+    "FP32 latency" : 100
+    "PTQ latency" : 45
+    "QAT latency" : 38
+    "FP32 storage" : 100
+    "PTQ storage" : 22
+    "QAT storage" : 24
+```
diff --git a/docs/diagrams/core-architecture.mermaid.md b/docs/diagrams/core-architecture.mermaid.md
@@ -0,0 +1,30 @@
+```mermaid
+flowchart LR
+    subgraph Core [t81::core]
+        limb[l1:limb (48 trits)]
+        bigint[bigint (limb slices)]
+    end
+    subgraph HighLevel [Umbrella helpers]
+        Int[t81::Int]
+        Float[t81::Float / FloatN]
+        BigInt[t81::BigInt alias]
+        Ratio[t81::Ratio]
+        Vector[t81::Vector]
+    end
+    limb --> Int
+    limb --> Float
+    bigint --> BigInt
+    BigInt --> Ratio
+    Vector --> Float
+    Float --> Ratio
+    Vector --> Int
+    subgraph Ops [Arithmetic & GEMM]
+        GEMM[t81::linalg::gemm_ternary]
+        Fixed[t81::Fixed<N>]
+    end
+    Int --> Ops
+    Float --> Ops
+    Vector --> GEMM
+    Fixed --> GEMM
+    click Float "docs/api-overview.md" "See the helper summary"
+```
diff --git a/docs/diagrams/docs-sitemap.mermaid.md b/docs/diagrams/docs-sitemap.mermaid.md
@@ -0,0 +1,28 @@
+```mermaid
+graph TD
+    Docs[Docs portal]
+    GettingStarted[Getting started]
+    Specs[Specs & design]
+    Examples[Examples & testing]
+    Docs --> GettingStarted
+    Docs --> Specs
+    Docs --> Examples
+    GettingStarted --> README
+    GettingStarted --> PythonInstall
+    GettingStarted --> CLI
+    Specs --> Spec
+    Specs --> Design
+    Specs --> APIOverview
+    Examples --> Demos
+    Examples --> Tests
+    Examples --> Benchmarks
+    README[README.md]
+    PythonInstall[docs/python-install.md]
+    CLI[docs/references/cli-usage.md]
+    Spec[docs/t81lib-spec-v1.0.0.md]
+    Design[docs/design/]
+    APIOverview[docs/api-overview.md]
+    Demos[examples/README.md]
+    Tests[tests/]
+    Benchmarks[BENCHMARKS.md]
+```
diff --git a/docs/diagrams/gpu-dispatch.mermaid.md b/docs/diagrams/gpu-dispatch.mermaid.md
@@ -0,0 +1,20 @@
+```mermaid
+flowchart LR
+    torch[PyTorch tensor] --> extract[Extract metadata]
+    numpy[NumPy array] --> extract
+    extract --> validate[Validate device/dtype]
+    validate --> dispatch[Dispatch to backend]
+    dispatch --> cuda[CUDA kernel]
+    dispatch --> rocm[ROCm kernel]
+    dispatch --> cpu[CPU fallback]
+    cuda --> wrap[Wrap GPU tensor]
+    rocm --> wrap
+    cpu --> wrap
+    wrap --> return[Return to caller]
+    subgraph Errors
+        mismatch[Device mismatch] --> error[Error path]
+        unsupported[Unsupported dtype] --> error
+    end
+    validate --> mismatch
+    validate --> unsupported
+```
diff --git a/docs/diagrams/hardware-kernels.mermaid.md b/docs/diagrams/hardware-kernels.mermaid.md
@@ -0,0 +1,13 @@
+```mermaid
+flowchart TB
+    tryte[Packed trytes (limbs)]
+    load[Load registers (AVX/NEON)]
+    mask[Mask & expand trits]
+    multiply[Multiply columns]
+    accumulate[Accumulate into FP32/BF16]
+    store[Store to output buffer]
+    tryte --> load --> mask --> multiply --> accumulate --> store
+    style load stroke:#333,stroke-width:1px
+    style mask stroke:#f66,stroke-width:1px
+    style multiply stroke:#36c,stroke-width:1px
+```
diff --git a/docs/diagrams/quantization-workflow.mermaid.md b/docs/diagrams/quantization-workflow.mermaid.md
@@ -0,0 +1,14 @@
+```mermaid
+sequenceDiagram
+    participant PyTorch
+    participant Quantizer
+    participant CLI
+    participant Runtime
+
+    PyTorch->>Quantizer: export float model
+    Quantizer->>Quantizer: `t81.torch` quantizes (TernaryTensor)
+    Quantizer->>CLI: pack weights, store GGUF
+    CLI->>Runtime: load GGUF
+    Runtime->>Runtime: run `gemm_ternary` + accumulators
+    Runtime->>PyTorch: return inference results
+```
diff --git a/docs/gpu.md b/docs/gpu.md
@@ -2,4 +2,4 @@
 
 CUDA/ROCm kernels can be built when you configure with `-DUSE_CUDA=ON` or `-DUSE_ROCM=ON` (see `python/CMakeLists.txt`). The bindings expose `t81lib.where`, `t81lib.clamp`, `t81lib.lerp`, and `t81lib.addcmul`, which accept either NumPy buffers or PyTorch tensors and dispatch directly to the GPU kernels.
 
-Dispatch relies on `t81::TensorMetadata` (`include/t81/tensor_metadata.hpp`): a lightweight struct that carries device tags, dtype codes, shape, strides, and `data_ptr` so the dispatcher can call the right CUDA/HIP kernel without copies. When torch is available, `t81lib` automatically wraps tensors; without torch it gracefully falls back to CPU buffers. Review `python/bindings.cpp` for the extraction helpers and lifetime management.
+Dispatch relies on `t81::TensorMetadata` (`include/t81/tensor_metadata.hpp`): a lightweight struct that carries device tags, dtype codes, shape, strides, and `data_ptr` so the dispatcher can call the right CUDA/HIP kernel without copies. When torch is available, `t81lib` automatically wraps tensors; without torch it gracefully falls back to CPU buffers. Review `python/bindings.cpp` for the extraction helpers and lifetime management and follow the [GPU dispatch diagram](diagrams/gpu-dispatch.mermaid.md) for the metadata flow.
diff --git a/docs/hardware.md b/docs/hardware.md
@@ -9,3 +9,5 @@ Highlights include:
 - `simulate_torch_forward()` plus power-tracing hooks that tally trit flips, emulate hybrid forward passes, and let you compare ternary energy to binary energy budgets.
 
 See `examples/ternary_hardware_sim_demo.ipynb` for a guided walkthrough that builds a ternary adder, runs a small inference, records virtual power/latency metrics, and highlights how balanced ternary drops switching energy for drones or tiny neuromorphic chips.
+
+The [hardware kernel sketch](docs/diagrams/hardware-kernels.mermaid.md) summarizes the tryte packing → multiply → accumulate flow used by the AVX/NEON GEMM paths.
diff --git a/docs/index.md b/docs/index.md
@@ -41,6 +41,7 @@ to understand the balanced ternary engine without digging through specs immediat
 - **GPU backends** — [`docs/gpu.md`](gpu.md) explains the CUDA/ROCm build flags and tensor metadata routing.
 - **API overview** — [`docs/api-overview.md`](api-overview.md) summarizes the numeric containers and helpers exposed via `<t81/t81lib.hpp>`.
 - **Tests & benchmarks** — [`tests/`](../tests/) documents the unit/property coverage while [`bench/`](../bench/) shows throughput patterns.
+- **Docs sitemap** — the [`docs/diagrams/docs-sitemap.mermaid.md`](diagrams/docs-sitemap.mermaid.md) mind map visualizes the content hierarchy referenced on this page.
 
 ## Stay aligned
 
diff --git a/docs/use-cases.md b/docs/use-cases.md
@@ -19,4 +19,4 @@ These scripts and notebooks mirror the CLI workflows while keeping you inside Py
 
 ## Additional references
 
-Mentioned demos also appear in `docs/index.md` and `docs/references/cli-usage.md` so you can toggle between CLI helpers and Python stories without guessing.
+Mentioned demos also appear in `docs/index.md` and `docs/references/cli-usage.md` so you can toggle between CLI helpers and Python stories without guessing. The [quantization workflow diagram](docs/diagrams/quantization-workflow.mermaid.md) ties the PyTorch conversion path to the CLI export/inference steps above.
diff --git a/examples/README.md b/examples/README.md
@@ -17,4 +17,4 @@ This file is the canonical list of runnable scripts and notebooks maintained und
 - `examples/ternary_hardware_sim_demo.ipynb` — Build a ternary adder, trace virtual power/latency, and compare energy vs binary hardware using `t81.hardware.TernaryEmulator`.
 - `examples/cli-examples.md` — Copy/paste-ready snippets for `t81-convert`, `t81-gguf`, and `t81-qat` workflows.
 
-Refer to [docs/use-cases.md](docs/use-cases.md) for details on how these examples tie into broader quantization, scaling-law, and hardware experiments.
+Refer to [docs/use-cases.md](docs/use-cases.md) for details on how these examples tie into broader quantization, scaling-law, and hardware experiments, and consult the [quantization workflow diagram](../docs/diagrams/quantization-workflow.mermaid.md) for the PyTorch → CLI → inference path.