Performance optimizations and fix Poseidon's compact mode by srinathsetty · Pull Request #477 · microsoft/Nova

srinathsetty · 2026-03-16T23:42:14Z

No description provided.

- Classify R1CS matrix coefficients at setup into zero/one/minus-one/other buckets for branch-free SpMV hot loops - Add multiply_vec_pair_into for buffer-reuse dual-vector SpMV - Fused dual-vector sparse matrix multiply and multi_evaluate - Precomputed SpMV in multiply_vec and commit_T with clone elimination - Sequential path for small sparse matrix multiply (<4097 rows) - Sequential multiply_vec, commit_T, commit_T_relaxed for small circuits

- Add sequential MSM path for small inputs (<4096 points) - Add parallelism threshold to bind_poly_var_top for small arrays - Raise sequential thresholds for MSM, SpMV, and fold to avoid nested rayon contention when many small operations run concurrently - Add WitnessCS::clear() and with_capacity() for reusing constraint system buffers across IVC steps - Remove unused msm_large_xyzz function

- Replace decimal string conversion in nat_to_f with direct byte-based conversion using to_repr/from_repr - Optimize fits_in_bits and decompose with single to_repr call - Add nat_to_limbs fast path for direct byte access - Remove unused BitAccess trait

In compact mode, ensure_allocated was a no-op for Elt::Allocated but allocated a new variable for Elt::Num. Whether an element is Num vs Allocated depends on the MDS matrix multiplication results, which vary by input data. This caused different variable counts for different inputs, breaking IVC schemes that require a fixed R1CS shape. Fix: always allocate a fresh variable and enforce equality, regardless of the Elt variant. This guarantees consistent R1CS structure.

Copilot

Pull request overview

This PR introduces several performance-oriented optimizations across Spartan/R1CS evaluation, MSM, and nonnative gadget utilities, largely by reducing Rayon overhead for small workloads and by adding a precomputed sparse-matrix SpMV accelerator intended to reduce field multiplication cost during R1CS matrix-vector products.

Changes:

Add PrecomputedSparseMatrix and integrate lazy/eager precomputation into R1CSShape, plus new dual-vector SpMV APIs.
Optimize multilinear polynomial multi-evaluation by fusing reductions, and add sequential fallbacks for small sizes to avoid Rayon overhead.
Reduce repeated to_repr() calls in nonnative bit decomposition and add witness CS capacity/clearing helpers; adjust Poseidon Elt::ensure_allocated to stabilize R1CS variable counts.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/spartan/polys/multilinear.rs	Adds small-size sequential path and fuses multi-evaluation reductions for cache efficiency.
src/r1cs/sparse.rs	Introduces `PrecomputedSparseMatrix` SpMV accelerator and adds dual-vector SpMV helpers + sequential thresholds.
src/r1cs/mod.rs	Exposes `PrecomputedSparseMatrix`, adds lazy/eager precompute storage in `R1CSShape`, and new pairwise multiply APIs.
src/provider/msm.rs	Adds sequential MSM path for medium-sized inputs to reduce Rayon contention.
src/gadgets/nonnative/util.rs	Optimizes bit extraction by caching `to_repr()` bytes; changes `nat_to_f` conversion implementation.
src/gadgets/nonnative/mod.rs	Removes the now-unused `BitAccess` trait.
src/gadgets/nonnative/bignat.rs	Adds a byte-based fast path for limb extraction.
src/frontend/util_cs/witness_cs.rs	Adds `with_capacity` and `clear` to reuse witness allocations.
src/frontend/gadgets/poseidon/circuit2.rs	Changes `ensure_allocated` to always allocate a fresh variable for shape stability.
Cargo.toml	Bumps crate version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/r1cs/sparse.rs

+  row_unit_pos: Vec<(u32, u16)>,
+  row_unit_neg: Vec<(u32, u16)>,
+  row_small: Vec<(u32, u16)>,
+  row_general: Vec<(u32, u16)>,


src/r1cs/sparse.rs

+        if val == one {
+          unit_pos_cols.push(col as u32);
+        } else if val == neg_one {
+          unit_neg_cols.push(col as u32);
+        } else if let Some(k) = small_pos.iter().position(|&v| v == val) {
+          small_cols.push(col as u32);
+          small_coeffs.push((k as i8) + 2);
+        } else if let Some(k) = small_neg.iter().position(|&v| v == val) {
+          small_cols.push(col as u32);
+          small_coeffs.push(-((k as i8) + 2));
+        } else {
+          general_cols.push(col as u32);


src/r1cs/sparse.rs

+      row_unit_pos.push((up_start, (unit_pos_cols.len() as u32 - up_start) as u16));
+      row_unit_neg.push((un_start, (unit_neg_cols.len() as u32 - un_start) as u16));
+      row_small.push((sm_start, (small_cols.len() as u32 - sm_start) as u16));
+      row_general.push((g_start, (general_cols.len() as u32 - g_start) as u16));


src/r1cs/sparse.rs

+      use rayon::prelude::*;
+      // Split output into chunks that can be written independently
+      let results: Vec<(F, F)> = self
+        .indptr
+        .par_windows(2)
+        .map(|ptrs| {
+          let mut s1 = F::ZERO;
+          let mut s2 = F::ZERO;
+          for (val, col_idx) in self.get_row_unchecked(ptrs.try_into().unwrap()) {
+            s1 += *val * v1[*col_idx];
+            s2 += *val * v2[*col_idx];
+          }
+          (s1, s2)
+        })
+        .collect();
+      for (row_idx, (s1, s2)) in results.into_iter().enumerate() {
+        out1[row_idx] = s1;
+        out2[row_idx] = s2;
+      }
+    }


src/r1cs/mod.rs

+    if z1.len() != self.num_io + self.num_vars + 1
+      || z2.len() != self.num_io + self.num_vars + 1
+    {
+      return Err(NovaError::InvalidWitnessLength);
+    }
+
+    rayon::join(
+      || self.A.multiply_vec_pair_into(z1, z2, az1, az2),
+      || {
+        rayon::join(
+          || self.B.multiply_vec_pair_into(z1, z2, bz1, bz2),
+          || self.C.multiply_vec_pair_into(z1, z2, cz1, cz2),
+        )
+      },
+    );
+


src/gadgets/nonnative/util.rs

+    // Pre-compute all bit values from the field element's byte representation
+    // to avoid calling to_repr() per bit (which does Montgomery reduction each time).
+    let bit_values: Option<Vec<bool>> = v.map(|val| {
+      let repr = val.to_repr();
+      let bytes = repr.as_ref();
+      (0..n_bits)
+        .map(|i| {
+          let (byte_pos, bit_pos) = (i / 8, i % 8);
+          if byte_pos < bytes.len() {
+            (bytes[byte_pos] >> bit_pos) & 1 == 1
+          } else {
+            false
+          }
+        })
+        .collect()
+    });


src/gadgets/nonnative/bignat.rs

+  if nat.bits() as usize > n_limbs * limb_width {
+    return Err(SynthesisError::Unsatisfiable(format!(
+      "nat {nat} does not fit in {n_limbs} limbs of width {limb_width}"
+    )));
+  }
+  let bytes_per_limb = limb_width / 8;
+  if limb_width % 8 == 0 && bytes_per_limb <= 8 {
+    // Fast path: extract limbs directly from bytes
+    let (_, bytes) = nat.to_bytes_le();
+    Ok(


src/r1cs/sparse.rs

+    debug_assert_eq!(self.cols, v1.len());
+    debug_assert_eq!(self.cols, v2.len());
+    let n = self.indptr.len() - 1;
+    debug_assert!(out1.len() >= n);
+    debug_assert!(out2.len() >= n);


src/r1cs/sparse.rs

+impl<F: PrimeField> PrecomputedSparseMatrix<F> {
+  /// Build from a CSR SparseMatrix by classifying entries.
+  pub fn from_sparse(m: &SparseMatrix<F>) -> Self {
+    let num_rows = m.indptr.len() - 1;
+    let one = F::ONE;
+    let neg_one = -F::ONE;
+
+    // Precompute small positive/negative field elements for comparison
+    let small_pos: Vec<F> = (2u64..=7).map(F::from).collect();
+    let small_neg: Vec<F> = (2u64..=7).map(|k| -F::from(k)).collect();
+
+    let mut row_unit_pos = Vec::with_capacity(num_rows);
+    let mut row_unit_neg = Vec::with_capacity(num_rows);
+    let mut row_small = Vec::with_capacity(num_rows);
+    let mut row_general = Vec::with_capacity(num_rows);
+    let mut unit_pos_cols = Vec::new();
+    let mut unit_neg_cols = Vec::new();
+    let mut small_cols = Vec::new();
+    let mut small_coeffs: Vec<i8> = Vec::new();
+    let mut general_cols = Vec::new();
+    let mut general_vals = Vec::new();
+
+    for ptrs in m.indptr.windows(2) {
+      let up_start = unit_pos_cols.len() as u32;
+      let un_start = unit_neg_cols.len() as u32;
+      let sm_start = small_cols.len() as u32;
+      let g_start = general_cols.len() as u32;
+
+      for (&val, &col) in m.data[ptrs[0]..ptrs[1]].iter().zip(&m.indices[ptrs[0]..ptrs[1]]) {
+        if val == one {
+          unit_pos_cols.push(col as u32);
+        } else if val == neg_one {
+          unit_neg_cols.push(col as u32);
+        } else if let Some(k) = small_pos.iter().position(|&v| v == val) {
+          small_cols.push(col as u32);
+          small_coeffs.push((k as i8) + 2);
+        } else if let Some(k) = small_neg.iter().position(|&v| v == val) {
+          small_cols.push(col as u32);
+          small_coeffs.push(-((k as i8) + 2));
+        } else {
+          general_cols.push(col as u32);
+          general_vals.push(val);
+        }
+      }
+
+      row_unit_pos.push((up_start, (unit_pos_cols.len() as u32 - up_start) as u16));
+      row_unit_neg.push((un_start, (unit_neg_cols.len() as u32 - un_start) as u16));
+      row_small.push((sm_start, (small_cols.len() as u32 - sm_start) as u16));
+      row_general.push((g_start, (general_cols.len() as u32 - g_start) as u16));
+    }
+
+
+    Self {
+      num_rows,
+      row_unit_pos,
+      row_unit_neg,
+      row_small,
+      row_general,
+      unit_pos_cols,
+      unit_neg_cols,
+      small_cols,
+      small_coeffs,
+      general_cols,
+      general_vals,
+    }
+  }


src/r1cs/mod.rs

+    // Build Z = Z1 + Z2 in one pass without allocating Z1/Z2 separately
+    let n_w = W1.W.len();
+    let n_x = U1.X.len();
+    let z_len = n_w + 1 + n_x;
+    let mut Z = Vec::with_capacity(z_len);
+    for i in 0..n_w {
+      Z.push(W1.W[i] + W2.W[i]);
+    }
+    Z.push(U1.u + E::Scalar::ONE);
+    for i in 0..n_x {
+      Z.push(U1.X[i] + U2.X[i]);
+    }


- Restore par_iter with size threshold (>65536) in RelaxedR1CSWitness::fold and fold_relaxed to avoid regression on large witnesses (918K elements) - Widen PrecomputedSparseMatrix row span counts from u16 to u32 - Upgrade debug_assert to assert in multiply_vec_pair_into - Eliminate intermediate Vec allocation in parallel multiply_vec_pair_into by writing directly into output slices via par_iter_mut - Add PrecomputedSparseMatrix::multiply_vec_pair_into for _into API - Route multiply_vec and multiply_vec_pair_into through precomputed tables lazily (get_or_init) instead of requiring explicit precompute_sparse_matrices - Add length validation in commit_T for mismatched W1/W2 and U1/U2 - Reject negative BigInt in nat_to_limbs fast path - Add n_bits bounds check in fits_in_bits against Scalar::NUM_BITS - Use flat buffer in multi_evaluate_with to avoid per-row Vec allocations - Document ensure_allocated requires enforce=true for soundness

srinathsetty added 5 commits March 16, 2026 23:37

Bump version to 0.66.0

da33a6b

srinathsetty requested a review from Copilot March 16, 2026 23:42

Copilot started reviewing on behalf of srinathsetty March 16, 2026 23:43 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations and fix Poseidon's compact mode#477

Performance optimizations and fix Poseidon's compact mode#477
srinathsetty wants to merge 6 commits intomainfrom
performance

srinathsetty commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

srinathsetty commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants