feat: add StringSequenceToEmbedding transformer and layer#47
Open
mruiyangyou wants to merge 4 commits into
Open
feat: add StringSequenceToEmbedding transformer and layer#47mruiyangyou wants to merge 4 commits into
mruiyangyou wants to merge 4 commits into
Conversation
Parses a delimited string of pre-computed embedding vectors into a (seq_len, embedding_dim) float tensor, with optional reversal of the non-pad portion of each sequence. Includes Spark transformer, Keras layer, unit tests, Spark/TF parity tests, and serialisation test entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drops a trailing size-1 dimension from the input shape before appending (seq_len, embedding_dim), matching the StringToStringListLayer convention. This lets inputs of shape (None, 1, 1) produce outputs of shape (None, 1, seq_len, embedding_dim) without a downstream squeeze. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tf.strings.split produces empty-string tokens when inputs contain leading, trailing, or repeated separators (or are entirely empty), which then caused tf.strings.to_number to fail at graph execution with "StringToNumberOp could not correctly convert string: ". Replace empty tokens with pad_value before the numeric cast so the layer matches the Spark transformer's behaviour. Adds regression tests covering empty cells, leading, trailing, and repeated separators. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
Hey @mruiyangyou - this looks good at a glance, but we are gearing up to do a big new kamae v3 release in the next few weeks or so. Can this wait till after these? I can aid migrating this to keras 3. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
StringSequenceToEmbeddingtransformer/layer pair that parses a delimited string of pre-computed embedding vectors into a dense(seq_len, embedding_dim)float tensor.StringSequenceToEmbeddingTransformer) and Keras layer (StringSequenceToEmbeddingLayer) with parity behaviour.sequence_separator(default,), floats within a vector byseparator(default|). Example:"1|2|3,4|5|6"withseq_len=2, embedding_dim=3→[[1,2,3],[4,5,6]].pad_value(default"0"); truncates long ones.reversemode reverses only the non-pad portion of each sequence (useful for chronological → recency-first ordering).StringToStringListLayerconvention, so(None, 1, 1)inputs produce(None, 1, seq_len, embedding_dim)outputs without a downstream squeeze.pad_valuebeforetf.strings.to_numberto avoidStringToNumberOpfailures at graph execution.Test plan
tests/kamae/tensorflow/layers/test_string_sequence_to_embedding.py— 9 unit tests covering default/custom separators, padding, truncation, reverse, trailing-1 squeeze behaviour, empty/malformed inputs, config round-trip, and invalid args.tests/kamae/spark/transformers/test_string_sequence_to_embedding.py— 6 tests including Spark/TF parity across separators, padding, and reverse modes.tests/kamae/tensorflow/test_layer_serialisation.py— added the new layer to the serialisation matrix; passestest_all_layers_tested_for_serialisation.string_to_string_listtests.🤖 Generated with Claude Code