feat: add StringSequenceToEmbedding transformer and layer by mruiyangyou · Pull Request #47 · ExpediaGroup/kamae

mruiyangyou · 2026-05-18T09:58:10Z

Summary

Adds a new StringSequenceToEmbedding transformer/layer pair that parses a delimited string of pre-computed embedding vectors into a dense (seq_len, embedding_dim) float tensor.

Spark transformer (StringSequenceToEmbeddingTransformer) and Keras layer (StringSequenceToEmbeddingLayer) with parity behaviour.
Input format: vectors separated by sequence_separator (default ,), floats within a vector by separator (default |). Example: "1|2|3,4|5|6" with seq_len=2, embedding_dim=3 → [[1,2,3],[4,5,6]].
Pads short sequences with pad_value (default "0"); truncates long ones.
Optional reverse mode reverses only the non-pad portion of each sequence (useful for chronological → recency-first ordering).
Layer drops a trailing size-1 input axis to match the StringToStringListLayer convention, so (None, 1, 1) inputs produce (None, 1, seq_len, embedding_dim) outputs without a downstream squeeze.
Empty tokens (from leading/trailing/repeated separators or fully empty cells) are replaced with pad_value before tf.strings.to_number to avoid StringToNumberOp failures at graph execution.
README updated with a row in the supported layers table.

Test plan

tests/kamae/tensorflow/layers/test_string_sequence_to_embedding.py — 9 unit tests covering default/custom separators, padding, truncation, reverse, trailing-1 squeeze behaviour, empty/malformed inputs, config round-trip, and invalid args.
tests/kamae/spark/transformers/test_string_sequence_to_embedding.py — 6 tests including Spark/TF parity across separators, padding, and reverse modes.
tests/kamae/tensorflow/test_layer_serialisation.py — added the new layer to the serialisation matrix; passes test_all_layers_tested_for_serialisation.
No regressions in existing string_to_string_list tests.

🤖 Generated with Claude Code

Parses a delimited string of pre-computed embedding vectors into a (seq_len, embedding_dim) float tensor, with optional reversal of the non-pad portion of each sequence. Includes Spark transformer, Keras layer, unit tests, Spark/TF parity tests, and serialisation test entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drops a trailing size-1 dimension from the input shape before appending (seq_len, embedding_dim), matching the StringToStringListLayer convention. This lets inputs of shape (None, 1, 1) produce outputs of shape (None, 1, seq_len, embedding_dim) without a downstream squeeze. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tf.strings.split produces empty-string tokens when inputs contain leading, trailing, or repeated separators (or are entirely empty), which then caused tf.strings.to_number to fail at graph execution with "StringToNumberOp could not correctly convert string: ". Replace empty tokens with pad_value before the numeric cast so the layer matches the Spark transformer's behaviour. Adds regression tests covering empty cells, leading, trailing, and repeated separators. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

georyetti · 2026-05-19T09:23:29Z

Hey @mruiyangyou - this looks good at a glance, but we are gearing up to do a big new kamae v3 release in the next few weeks or so. Can this wait till after these? I can aid migrating this to keras 3.

ruiyyou and others added 4 commits May 13, 2026 12:48

docs: add StringSequenceToEmbedding row to supported layers table

7bc4ed9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mruiyangyou requested a review from a team as a code owner May 18, 2026 09:58

mruiyangyou requested review from georyetti and jacobjwood May 18, 2026 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add StringSequenceToEmbedding transformer and layer#47

feat: add StringSequenceToEmbedding transformer and layer#47
mruiyangyou wants to merge 4 commits into
ExpediaGroup:mainfrom
mruiyangyou:add-string-sequence-to-embedding

mruiyangyou commented May 18, 2026

Uh oh!

georyetti commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mruiyangyou commented May 18, 2026

Summary

Test plan

Uh oh!

georyetti commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants