Skip to content

feat: add StringSequenceToEmbedding transformer and layer#47

Open
mruiyangyou wants to merge 4 commits into
ExpediaGroup:mainfrom
mruiyangyou:add-string-sequence-to-embedding
Open

feat: add StringSequenceToEmbedding transformer and layer#47
mruiyangyou wants to merge 4 commits into
ExpediaGroup:mainfrom
mruiyangyou:add-string-sequence-to-embedding

Conversation

@mruiyangyou
Copy link
Copy Markdown

Summary

Adds a new StringSequenceToEmbedding transformer/layer pair that parses a delimited string of pre-computed embedding vectors into a dense (seq_len, embedding_dim) float tensor.

  • Spark transformer (StringSequenceToEmbeddingTransformer) and Keras layer (StringSequenceToEmbeddingLayer) with parity behaviour.
  • Input format: vectors separated by sequence_separator (default ,), floats within a vector by separator (default |). Example: "1|2|3,4|5|6" with seq_len=2, embedding_dim=3[[1,2,3],[4,5,6]].
  • Pads short sequences with pad_value (default "0"); truncates long ones.
  • Optional reverse mode reverses only the non-pad portion of each sequence (useful for chronological → recency-first ordering).
  • Layer drops a trailing size-1 input axis to match the StringToStringListLayer convention, so (None, 1, 1) inputs produce (None, 1, seq_len, embedding_dim) outputs without a downstream squeeze.
  • Empty tokens (from leading/trailing/repeated separators or fully empty cells) are replaced with pad_value before tf.strings.to_number to avoid StringToNumberOp failures at graph execution.
  • README updated with a row in the supported layers table.

Test plan

  • tests/kamae/tensorflow/layers/test_string_sequence_to_embedding.py — 9 unit tests covering default/custom separators, padding, truncation, reverse, trailing-1 squeeze behaviour, empty/malformed inputs, config round-trip, and invalid args.
  • tests/kamae/spark/transformers/test_string_sequence_to_embedding.py — 6 tests including Spark/TF parity across separators, padding, and reverse modes.
  • tests/kamae/tensorflow/test_layer_serialisation.py — added the new layer to the serialisation matrix; passes test_all_layers_tested_for_serialisation.
  • No regressions in existing string_to_string_list tests.

🤖 Generated with Claude Code

ruiyyou and others added 4 commits May 13, 2026 12:48
Parses a delimited string of pre-computed embedding vectors into a
(seq_len, embedding_dim) float tensor, with optional reversal of the
non-pad portion of each sequence. Includes Spark transformer, Keras
layer, unit tests, Spark/TF parity tests, and serialisation test entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drops a trailing size-1 dimension from the input shape before appending
(seq_len, embedding_dim), matching the StringToStringListLayer
convention. This lets inputs of shape (None, 1, 1) produce outputs of
shape (None, 1, seq_len, embedding_dim) without a downstream squeeze.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tf.strings.split produces empty-string tokens when inputs contain
leading, trailing, or repeated separators (or are entirely empty),
which then caused tf.strings.to_number to fail at graph execution
with "StringToNumberOp could not correctly convert string: ".

Replace empty tokens with pad_value before the numeric cast so the
layer matches the Spark transformer's behaviour. Adds regression tests
covering empty cells, leading, trailing, and repeated separators.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mruiyangyou mruiyangyou requested a review from a team as a code owner May 18, 2026 09:58
@georyetti
Copy link
Copy Markdown
Contributor

Hey @mruiyangyou - this looks good at a glance, but we are gearing up to do a big new kamae v3 release in the next few weeks or so. Can this wait till after these? I can aid migrating this to keras 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants