Skip to content

treewide: Add NoC collective operations + MLSys experiments#99

Merged
fischeti merged 104 commits intomainfrom
feature/reduction
Mar 26, 2026
Merged

treewide: Add NoC collective operations + MLSys experiments#99
fischeti merged 104 commits intomainfrom
feature/reduction

Conversation

@colluca
Copy link
Copy Markdown
Contributor

@colluca colluca commented Dec 17, 2025

This PR includes all the developments that were done for the MLSys paper. This includes 1) the hardware extensions required to support (performant) multicast and reduction, 2) software benchmarks and tests and 3) an experiments' framework for Picobello derived from Snitch's, used to develop the experiments for the paper.

In detail:

  • Bump Snitch w/ support for reduction and DCA.
  • Bump AXI to enable rerouting all collective communications outside the cluster.
  • Bump FlooNoC w/ support for VCs, reductions and DCA.
  • Bump common_cells, iDMA and LLVM toolchain to align with new versions in previous IPs.
  • Update Snitch's configuration, with clearer fields for enabling narrow and wide collectives and setting the AWUSER width.
  • Pass cluster base offset to Snitch, needed internally to calculate end address of its address space.
  • Add path to experiments in PYTHONPATH, alternative to creating a proper Picobello Python package.
  • Install Snitch Python package in editable mode (useful when Snitch is bender-cloned and making modifications to it, and cause why not?).
  • Simplify build of Picobello's software tests by reusing Snitch's Make rules
  • Enable running simulations in directories other than PB_ROOT (this is IMO the best method for ensuring all simulation artifacts are collected under a different directory).
  • Create a proper Picobello accelerator runtime/library. src/ contains potentially reusable (across 2D tile-based accelerators) sources, impl/ contains a Picobello-specific implementation of the runtime/library (providing a Picobello-specific HAL and stitching together a Picobello-specific selection of the reusable sources, including Snitch's).
  • Implement a communicator-based (inspired by MPI) collective communication API (sync) for 2D tile-based accelerators.
  • Extend the team API.
  • Add barrier_benchmark.c, reduction_benchmark.c and dma_multicast_v2.c benchmarks.
  • Test that multiple outstanding barriers that overlap in their participating clusters function correctly (overlapping_barriers.c). This test used to fail when this feature was not supported, as desired.
  • Stress test row, column and global barriers happening in rapid succession (parallel_row_col_barriers.c).
  • Alias Snitch targets for visual trace generation (remove sn- prefix).

TODOs

  • Wait for VCs and reduction to be merged in FlooNoC (the reset this PR to Picobello's link-exploration branch)
  • Auto-generate NoC configuration for reductions
  • Bump to proper peakrdl-rawheader release
  • Rebase this PR on Picobello's main
  • What to do with reduction_benchmark_hyperbank.c?
  • What to do with summa_gemm.c and gemm_2d.c implementations?

@Lore0599 Lore0599 force-pushed the feature/reduction branch 7 times, most recently from 69ff120 to 0796dc5 Compare March 4, 2026 15:02
@Lore0599 Lore0599 force-pushed the feature/reduction branch 3 times, most recently from d0b9142 to 2b4f86b Compare March 6, 2026 16:25
@colluca colluca force-pushed the feature/reduction branch from c0cc426 to 879d079 Compare March 20, 2026 17:23
@Lore0599 Lore0599 force-pushed the feature/reduction branch from f4eb08d to 6a5cb8d Compare March 24, 2026 11:18
@Lore0599 Lore0599 force-pushed the feature/reduction branch from 9180b8f to f476df8 Compare March 25, 2026 17:06
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

verible-verilog-format

[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/mem_tile.sv

Lines 79 to 80 in f476df8

.offload_wide_req_o ( ),
.offload_wide_rsp_i ( '0 ),


[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/mem_tile.sv

Lines 82 to 83 in f476df8

.offload_narrow_req_o ( ),
.offload_narrow_rsp_i ('0 )


[verible-verilog-format] reported by reviewdog 🐶

assign floo_wide_o[West:North] =router_floo_wide_out[West:North];


[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/mem_tile.sv

Lines 495 to 508 in f476df8

`ASSERT(NoCollectivOperation_NReq_In, (!floo_req_i[r].valid | (floo_req_i[r].req[0].generic.hdr.collective_op == Unicast)),
clk_i, !rst_ni,
$sformatf("Unsupported collective attempted with destination: %h", floo_req_i[r].req[0].narrow_aw.payload.addr))
`ASSERT(NoCollectivOperation_NRsp_In, (!floo_rsp_i[r].valid | (floo_rsp_i[r].rsp[0].generic.hdr.collective_op == Unicast)))
`ASSERT(NoCollectivOperation_NWide_In, (!floo_wide_i[r].valid | (floo_wide_i[r].wide[0].generic.hdr.collective_op == Unicast)),
clk_i, !rst_ni,
$sformatf("Unsupported collective attempted with destination: %h", floo_wide_i[r].wide[0].wide_aw.payload.addr))
`ASSERT(NoCollectivOperation_NReq_Out, (!floo_req_o[r].valid | (floo_req_o[r].req[0].generic.hdr.collective_op == Unicast)),
clk_i, !rst_ni,
$sformatf("Unsupported collective attempted with destination: %h", floo_req_o[r].req[0].narrow_aw.payload.addr))
`ASSERT(NoCollectivOperation_NRsp_Out, (!floo_rsp_o[r].valid | (floo_rsp_o[r].rsp[0].generic.hdr.collective_op == Unicast)))
`ASSERT(NoCollectivOperation_NWide_Out, (!floo_wide_o[r].valid | (floo_wide_o[r].wide[0].generic.hdr.collective_op == Unicast)),
clk_i, !rst_ni,
$sformatf("Unsupported collective attempted with destination: %h", floo_wide_i[r].wide[0].wide_aw.payload.addr))


[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/spm_tile.sv

Lines 89 to 101 in f476df8

.AxiCfgN (AxiCfgN),
.AxiCfgW (AxiCfgW),
.RouteAlgo (RouteCfgNoMcast.RouteAlgo),
.NumRoutes (5),
.InFifoDepth (2),
.OutFifoDepth(2),
.id_t (id_t),
.hdr_t (hdr_t),
.floo_req_t (floo_req_t),
.floo_rsp_t (floo_rsp_t),
.floo_wide_t (floo_wide_t),
.WideRwDecouple (WideRwDecouple),
.VcImpl (VcImpl)


[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/spm_tile.sv

Lines 107 to 116 in f476df8

.id_route_map_i('0),
.floo_req_i (router_floo_req_in),
.floo_rsp_o (router_floo_rsp_out),
.floo_req_o (router_floo_req_out),
.floo_rsp_i (router_floo_rsp_in),
.floo_wide_i (router_floo_wide_in),
.floo_wide_o (router_floo_wide_out),
// Wide Reduction offload port
.offload_wide_req_o ( ),
.offload_wide_rsp_i ( '0 ),


[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/spm_tile.sv

Lines 118 to 119 in f476df8

.offload_narrow_req_o ( ),
.offload_narrow_rsp_i ('0 )


[verible-verilog-format] reported by reviewdog 🐶

assign floo_wide_o[West:North] =router_floo_wide_out[West:North];

@Lore0599 Lore0599 force-pushed the feature/reduction branch from f476df8 to ec0ef69 Compare March 25, 2026 17:27
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

verible-verilog-format

[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/spm_tile.sv

Lines 89 to 101 in ec0ef69

.AxiCfgN (AxiCfgN),
.AxiCfgW (AxiCfgW),
.RouteAlgo (RouteCfgNoMcast.RouteAlgo),
.NumRoutes (5),
.InFifoDepth (2),
.OutFifoDepth(2),
.id_t (id_t),
.hdr_t (hdr_t),
.floo_req_t (floo_req_t),
.floo_rsp_t (floo_rsp_t),
.floo_wide_t (floo_wide_t),
.WideRwDecouple (WideRwDecouple),
.VcImpl (VcImpl)


[verible-verilog-format] reported by reviewdog 🐶

picobello/hw/spm_tile.sv

Lines 107 to 116 in ec0ef69

.id_route_map_i('0),
.floo_req_i (router_floo_req_in),
.floo_rsp_o (router_floo_rsp_out),
.floo_req_o (router_floo_req_out),
.floo_rsp_i (router_floo_rsp_in),
.floo_wide_i (router_floo_wide_in),
.floo_wide_o (router_floo_wide_out),
// Wide Reduction offload port
.offload_wide_req_o ( ),
.offload_wide_rsp_i ( '0 ),

@fischeti fischeti marked this pull request as ready for review March 26, 2026 16:06
@fischeti fischeti changed the title treewide: Add MLSys collective paper developments treewide: Add NoC collective operations + MLSys experiements Mar 26, 2026
@fischeti fischeti changed the title treewide: Add NoC collective operations + MLSys experiements treewide: Add NoC collective operations + MLSys experiments Mar 26, 2026
@fischeti fischeti merged commit cacdc3a into main Mar 26, 2026
4 of 5 checks passed
@fischeti fischeti deleted the feature/reduction branch March 26, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants