feat: add init cache type opts by polvalente · Pull Request #455 · elixir-nx/bumblebee

polvalente · 2026-05-05T01:32:17Z

This PR adds init_cache types and num_heads for better flexibility of text generation models

jonatanklosko · 2026-05-05T17:14:06Z

+      output_hidden_states: false,
+      output_attentions: false,


Do we need these new options? We prune these by default, and in order to actually return it in the model output, the user needs to opt-in by configuring global layer options.

I think we can drop these. I must've missed them in my self-review. My focus was on the cache typing

jonatanklosko · 2026-05-05T17:22:35Z

+      if is_nil(cross_hidden_state) do
+        {Layers.Decoder.get_self_attention_cache(block_cache), %Axon.None{}}


Why do we need a separate function? If cross attention is not enabled then get_attention_caches already returns none in the second element (or rather a model that compiles to none).

bumblebee/lib/bumblebee/layers/decoder.ex

Lines 218 to 224 in ccd6f4a

@doc """

Retrieves self-attention and cross-attention caches from a block

cache.

"""

def get_attention_caches(block_cache) do

{Axon.nx(block_cache, & &1.self_attention), Axon.nx(block_cache, & &1.cross_attention)}

end

It always returns cross-attention. Do you mean that cross attention is always Axon.None?

Also, being eager means fewer Axon.nx calls, which reduces the overall graph in Nx.Defn.Evaluator

jonatanklosko · 2026-05-05T17:56:18Z

+    # produced by projection layers running in compute precision, so this
+    # matches what the model will actually return for the cache.
+    cache_type = output_policy.compute || {:f, 32}
+    cache = init_cache(spec, batch_size, max_length, inputs, cache_type: cache_type)


Do we get anything from passing it downstream instead of casting as above?

It means we can fit in a smaller memory footprint, if we allocate bf16 instead of f32 and then downcast to bf16

jonatanklosko · 2026-05-05T17:57:53Z

+    # Use the compute precision as the cache type. The key/value tensors are
+    # produced by projection layers running in compute precision, so this
+    # matches what the model will actually return for the cache.
+    cache_type = output_policy.compute || {:f, 32}


IIRC the cache value returned from attention layers is cast using :output precision (since it's the layer output). That's why we cast as output here.

I'm not really sure how to model this with mixed precision policy. It may be that we don't want to cast cache at any point, but then we don't have granularity to specify that, since it's a specific input/output.

I went with using the policy, but originally what I wanted was to introduce a new explicit parameter for the cache type. When I just used the output, at least in my use-case, things ended up using f32 for the cache instead of bf16 like I wanted

polvalente added 2 commits May 4, 2026 18:02

fix: attention cache type and num heads

080087a

refactor: pass type options for init_cache

ccd6f4a

polvalente self-assigned this May 5, 2026

jonatanklosko reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add init cache type opts#455

feat: add init cache type opts#455
polvalente wants to merge 2 commits intomainfrom
pv-fix/gqa-cache-and-inference-opts

polvalente commented May 5, 2026

Uh oh!

jonatanklosko May 5, 2026

Uh oh!

polvalente May 5, 2026

Uh oh!

jonatanklosko May 5, 2026

Uh oh!

polvalente May 5, 2026

Uh oh!

jonatanklosko May 5, 2026

Uh oh!

polvalente May 5, 2026 •

edited

Loading

Uh oh!

jonatanklosko May 5, 2026

Uh oh!

polvalente May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if is_nil(cross_hidden_state) do
		{Layers.Decoder.get_self_attention_cache(block_cache), %Axon.None{}}

	@doc """
	Retrieves self-attention and cross-attention caches from a block
	cache.
	"""
	def get_attention_caches(block_cache) do
	{Axon.nx(block_cache, & &1.self_attention), Axon.nx(block_cache, & &1.cross_attention)}
	end

Conversation

polvalente commented May 5, 2026

Uh oh!

jonatanklosko May 5, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jonatanklosko May 5, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jonatanklosko May 5, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonatanklosko May 5, 2026

Choose a reason for hiding this comment

Uh oh!

polvalente May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polvalente May 5, 2026 •

edited

Loading