Skip to content

Performance optimizations and fix Poseidon's compact mode#477

Open
srinathsetty wants to merge 6 commits intomainfrom
performance
Open

Performance optimizations and fix Poseidon's compact mode#477
srinathsetty wants to merge 6 commits intomainfrom
performance

Conversation

@srinathsetty
Copy link
Collaborator

No description provided.

- Classify R1CS matrix coefficients at setup into zero/one/minus-one/other
  buckets for branch-free SpMV hot loops
- Add multiply_vec_pair_into for buffer-reuse dual-vector SpMV
- Fused dual-vector sparse matrix multiply and multi_evaluate
- Precomputed SpMV in multiply_vec and commit_T with clone elimination
- Sequential path for small sparse matrix multiply (<4097 rows)
- Sequential multiply_vec, commit_T, commit_T_relaxed for small circuits
- Add sequential MSM path for small inputs (<4096 points)
- Add parallelism threshold to bind_poly_var_top for small arrays
- Raise sequential thresholds for MSM, SpMV, and fold to avoid
  nested rayon contention when many small operations run concurrently
- Add WitnessCS::clear() and with_capacity() for reusing constraint
  system buffers across IVC steps
- Remove unused msm_large_xyzz function
- Replace decimal string conversion in nat_to_f with direct byte-based
  conversion using to_repr/from_repr
- Optimize fits_in_bits and decompose with single to_repr call
- Add nat_to_limbs fast path for direct byte access
- Remove unused BitAccess trait
In compact mode, ensure_allocated was a no-op for Elt::Allocated but
allocated a new variable for Elt::Num. Whether an element is Num vs
Allocated depends on the MDS matrix multiplication results, which vary
by input data. This caused different variable counts for different
inputs, breaking IVC schemes that require a fixed R1CS shape.

Fix: always allocate a fresh variable and enforce equality, regardless
of the Elt variant. This guarantees consistent R1CS structure.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces several performance-oriented optimizations across Spartan/R1CS evaluation, MSM, and nonnative gadget utilities, largely by reducing Rayon overhead for small workloads and by adding a precomputed sparse-matrix SpMV accelerator intended to reduce field multiplication cost during R1CS matrix-vector products.

Changes:

  • Add PrecomputedSparseMatrix and integrate lazy/eager precomputation into R1CSShape, plus new dual-vector SpMV APIs.
  • Optimize multilinear polynomial multi-evaluation by fusing reductions, and add sequential fallbacks for small sizes to avoid Rayon overhead.
  • Reduce repeated to_repr() calls in nonnative bit decomposition and add witness CS capacity/clearing helpers; adjust Poseidon Elt::ensure_allocated to stabilize R1CS variable counts.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/spartan/polys/multilinear.rs Adds small-size sequential path and fuses multi-evaluation reductions for cache efficiency.
src/r1cs/sparse.rs Introduces PrecomputedSparseMatrix SpMV accelerator and adds dual-vector SpMV helpers + sequential thresholds.
src/r1cs/mod.rs Exposes PrecomputedSparseMatrix, adds lazy/eager precompute storage in R1CSShape, and new pairwise multiply APIs.
src/provider/msm.rs Adds sequential MSM path for medium-sized inputs to reduce Rayon contention.
src/gadgets/nonnative/util.rs Optimizes bit extraction by caching to_repr() bytes; changes nat_to_f conversion implementation.
src/gadgets/nonnative/mod.rs Removes the now-unused BitAccess trait.
src/gadgets/nonnative/bignat.rs Adds a byte-based fast path for limb extraction.
src/frontend/util_cs/witness_cs.rs Adds with_capacity and clear to reuse witness allocations.
src/frontend/gadgets/poseidon/circuit2.rs Changes ensure_allocated to always allocate a fresh variable for shape stability.
Cargo.toml Bumps crate version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +27
row_unit_pos: Vec<(u32, u16)>,
row_unit_neg: Vec<(u32, u16)>,
row_small: Vec<(u32, u16)>,
row_general: Vec<(u32, u16)>,
Comment on lines +69 to +80
if val == one {
unit_pos_cols.push(col as u32);
} else if val == neg_one {
unit_neg_cols.push(col as u32);
} else if let Some(k) = small_pos.iter().position(|&v| v == val) {
small_cols.push(col as u32);
small_coeffs.push((k as i8) + 2);
} else if let Some(k) = small_neg.iter().position(|&v| v == val) {
small_cols.push(col as u32);
small_coeffs.push(-((k as i8) + 2));
} else {
general_cols.push(col as u32);
Comment on lines +85 to +88
row_unit_pos.push((up_start, (unit_pos_cols.len() as u32 - up_start) as u16));
row_unit_neg.push((un_start, (unit_neg_cols.len() as u32 - un_start) as u16));
row_small.push((sm_start, (small_cols.len() as u32 - sm_start) as u16));
row_general.push((g_start, (general_cols.len() as u32 - g_start) as u16));
Comment on lines +340 to +359
use rayon::prelude::*;
// Split output into chunks that can be written independently
let results: Vec<(F, F)> = self
.indptr
.par_windows(2)
.map(|ptrs| {
let mut s1 = F::ZERO;
let mut s2 = F::ZERO;
for (val, col_idx) in self.get_row_unchecked(ptrs.try_into().unwrap()) {
s1 += *val * v1[*col_idx];
s2 += *val * v2[*col_idx];
}
(s1, s2)
})
.collect();
for (row_idx, (s1, s2)) in results.into_iter().enumerate() {
out1[row_idx] = s1;
out2[row_idx] = s2;
}
}
Comment on lines +469 to +484
if z1.len() != self.num_io + self.num_vars + 1
|| z2.len() != self.num_io + self.num_vars + 1
{
return Err(NovaError::InvalidWitnessLength);
}

rayon::join(
|| self.A.multiply_vec_pair_into(z1, z2, az1, az2),
|| {
rayon::join(
|| self.B.multiply_vec_pair_into(z1, z2, bz1, bz2),
|| self.C.multiply_vec_pair_into(z1, z2, cz1, cz2),
)
},
);

Comment on lines +105 to +120
// Pre-compute all bit values from the field element's byte representation
// to avoid calling to_repr() per bit (which does Montgomery reduction each time).
let bit_values: Option<Vec<bool>> = v.map(|val| {
let repr = val.to_repr();
let bytes = repr.as_ref();
(0..n_bits)
.map(|i| {
let (byte_pos, bit_pos) = (i / 8, i % 8);
if byte_pos < bytes.len() {
(bytes[byte_pos] >> bit_pos) & 1 == 1
} else {
false
}
})
.collect()
});
Comment on lines +43 to +52
if nat.bits() as usize > n_limbs * limb_width {
return Err(SynthesisError::Unsatisfiable(format!(
"nat {nat} does not fit in {n_limbs} limbs of width {limb_width}"
)));
}
let bytes_per_limb = limb_width / 8;
if limb_width % 8 == 0 && bytes_per_limb <= 8 {
// Fast path: extract limbs directly from bytes
let (_, bytes) = nat.to_bytes_le();
Ok(
Comment on lines +322 to +326
debug_assert_eq!(self.cols, v1.len());
debug_assert_eq!(self.cols, v2.len());
let n = self.indptr.len() - 1;
debug_assert!(out1.len() >= n);
debug_assert!(out2.len() >= n);
Comment on lines +40 to +105
impl<F: PrimeField> PrecomputedSparseMatrix<F> {
/// Build from a CSR SparseMatrix by classifying entries.
pub fn from_sparse(m: &SparseMatrix<F>) -> Self {
let num_rows = m.indptr.len() - 1;
let one = F::ONE;
let neg_one = -F::ONE;

// Precompute small positive/negative field elements for comparison
let small_pos: Vec<F> = (2u64..=7).map(F::from).collect();
let small_neg: Vec<F> = (2u64..=7).map(|k| -F::from(k)).collect();

let mut row_unit_pos = Vec::with_capacity(num_rows);
let mut row_unit_neg = Vec::with_capacity(num_rows);
let mut row_small = Vec::with_capacity(num_rows);
let mut row_general = Vec::with_capacity(num_rows);
let mut unit_pos_cols = Vec::new();
let mut unit_neg_cols = Vec::new();
let mut small_cols = Vec::new();
let mut small_coeffs: Vec<i8> = Vec::new();
let mut general_cols = Vec::new();
let mut general_vals = Vec::new();

for ptrs in m.indptr.windows(2) {
let up_start = unit_pos_cols.len() as u32;
let un_start = unit_neg_cols.len() as u32;
let sm_start = small_cols.len() as u32;
let g_start = general_cols.len() as u32;

for (&val, &col) in m.data[ptrs[0]..ptrs[1]].iter().zip(&m.indices[ptrs[0]..ptrs[1]]) {
if val == one {
unit_pos_cols.push(col as u32);
} else if val == neg_one {
unit_neg_cols.push(col as u32);
} else if let Some(k) = small_pos.iter().position(|&v| v == val) {
small_cols.push(col as u32);
small_coeffs.push((k as i8) + 2);
} else if let Some(k) = small_neg.iter().position(|&v| v == val) {
small_cols.push(col as u32);
small_coeffs.push(-((k as i8) + 2));
} else {
general_cols.push(col as u32);
general_vals.push(val);
}
}

row_unit_pos.push((up_start, (unit_pos_cols.len() as u32 - up_start) as u16));
row_unit_neg.push((un_start, (unit_neg_cols.len() as u32 - un_start) as u16));
row_small.push((sm_start, (small_cols.len() as u32 - sm_start) as u16));
row_general.push((g_start, (general_cols.len() as u32 - g_start) as u16));
}


Self {
num_rows,
row_unit_pos,
row_unit_neg,
row_small,
row_general,
unit_pos_cols,
unit_neg_cols,
small_cols,
small_coeffs,
general_cols,
general_vals,
}
}
Comment on lines +586 to +597
// Build Z = Z1 + Z2 in one pass without allocating Z1/Z2 separately
let n_w = W1.W.len();
let n_x = U1.X.len();
let z_len = n_w + 1 + n_x;
let mut Z = Vec::with_capacity(z_len);
for i in 0..n_w {
Z.push(W1.W[i] + W2.W[i]);
}
Z.push(U1.u + E::Scalar::ONE);
for i in 0..n_x {
Z.push(U1.X[i] + U2.X[i]);
}
- Restore par_iter with size threshold (>65536) in RelaxedR1CSWitness::fold
  and fold_relaxed to avoid regression on large witnesses (918K elements)
- Widen PrecomputedSparseMatrix row span counts from u16 to u32
- Upgrade debug_assert to assert in multiply_vec_pair_into
- Eliminate intermediate Vec allocation in parallel multiply_vec_pair_into
  by writing directly into output slices via par_iter_mut
- Add PrecomputedSparseMatrix::multiply_vec_pair_into for _into API
- Route multiply_vec and multiply_vec_pair_into through precomputed tables
  lazily (get_or_init) instead of requiring explicit precompute_sparse_matrices
- Add length validation in commit_T for mismatched W1/W2 and U1/U2
- Reject negative BigInt in nat_to_limbs fast path
- Add n_bits bounds check in fits_in_bits against Scalar::NUM_BITS
- Use flat buffer in multi_evaluate_with to avoid per-row Vec allocations
- Document ensure_allocated requires enforce=true for soundness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants