Add json+struct codec by benjeffery · Pull Request #3306 · tskit-dev/tskit

benjeffery · 2025-10-24T15:33:41Z

Add a metadata codec that supports JSON and a binary blob.

codecov · 2025-10-24T15:39:30Z

Codecov Report

❌ Patch coverage is 90.20979% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.92%. Comparing base (0e37e51) to head (07b12ff).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3306      +/-   ##
==========================================
- Coverage   91.93%   91.92%   -0.01%     
==========================================
  Files          37       37              
  Lines       32011    32153     +142     
  Branches     5123     5143      +20     
==========================================
+ Hits        29428    29556     +128     
- Misses       2256     2264       +8     
- Partials      327      333       +6

Flag	Coverage Δ
C	`82.70% <100.00%> (+0.06%)`	⬆️
c-python	`77.34% <0.00%> (-0.20%)`	⬇️
python-tests	`96.40% <83.13%> (-0.14%)`	⬇️
python-tests-no-jit	`33.22% <14.45%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`98.69% <83.13%> (-0.17%)`	⬇️
Python C interface	`91.23% <ø> (ø)`
C library	`88.86% <100.00%> (+0.03%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

petrelharp · 2025-10-24T15:48:13Z

This is the right way.

bhaller · 2025-10-24T15:54:35Z

Hi @benjeffery! This looks like it might provide a reasonable solution, sure. You prefer it to the proposal you made before, I take it? Because at least it's just more top-level metadata, rather than an entirely new thing?

Anyhow, I guess my main question is how this will look from the C API perspective. I see you've got a special header:

_HDR = struct.Struct("<4sBQQ") # magic, version, json_len, blob_len

which you parse in the new codec to split out the JSON and binary parts. Do you intend to provide some kind of wrapper for this in C, so that clients don't need to hard-code the structure of this header, at write and at read? (Or at least make the "magic" and the codec version be public in a C header, so that those aspects of it don't need to be hard-coded in SLiM.)

@petrelharp will this work for you on the pyslim end of things? Ah, I see your comment above now. :->

bhaller · 2025-10-28T15:41:25Z

...Anyhow, I guess my main question is how this will look from the C API perspective. I see you've got a special header:

_HDR = struct.Struct("<4sBQQ") # magic, version, json_len, blob_len

which you parse in the new codec to split out the JSON and binary parts. Do you intend to provide some kind of wrapper for this in C, so that clients don't need to hard-code the structure of this header, at write and at read? (Or at least make the "magic" and the codec version be public in a C header, so that those aspects of it don't need to be hard-coded in SLiM.)

Hi @benjeffery, just a ping on this; wondering about the C API plan for this PR. Apart from that concern, this proposal looks great to me; once it's merged @petrelharp and I can update the tskit copy in SLiM to this and test it out. :-> Thanks!

benjeffery · 2025-10-30T11:43:02Z

Hi @benjeffery, just a ping on this

Sorry, school holidays here so had some leave. I'm not sure about this, it's been a long standing principle that the C code doesn't touch the contents of metadata or have anything to do with codecs. I can see a small helper that just extracts the binary buffer being helpful though. I've added that in the last commit.

bhaller · 2025-10-30T12:10:22Z

Sorry, school holidays here so had some leave.

:->

I'm not sure about this, it's been a long standing principle that the C code doesn't touch the contents of metadata or have anything to do with codecs. I can see a small helper that just extracts the binary buffer being helpful though. I've added that in the last commit.

This looks good to me, thanks. I can see the desire for the C code not to get into metadata/codec stuff, but this seems like a good exception to that principle to carve out, otherwise every client of this codec would have to reinvent this (potentially bug-prone) wheel. I think this is the right design.

As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct? Looks like after calling tsk_json_binary_metadata_get_blob() the length of the JSON can just be determined by the caller based upon the offset to, or the length of, the binary blob.

And since the header length and magic bytes and such are in core.c and thus private, it looks like the intent is for the caller to blindly call tsk_json_binary_metadata_get_blob() without knowing whether the metadata is in fact compliant, and use the return value to determine compliance and raise the appropriate user-visible error? Which seems right, just checking.

I'm happy with this, thanks!

benjeffery · 2025-10-31T11:15:03Z

As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct?

Yes, that's right as we have fixed lengths. Makes me think the helper should return pointers/lengths to both parts so that API users can grab and decode the JSON?

bhaller · 2025-10-31T11:20:39Z

As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct?

Yes, that's right as we have fixed lengths. Makes me think the helper should return pointers/lengths to both parts so that API users can grab and decode the JSON?

That would perhaps be an improvement, yes. It's easy enough for the client to do the math as it is now; but it does seem better for the helper to do it, so that the client is not building in assumptions about the layout of the metadata. Actually, now that I think about it more, the client would also have to build in an assumption about the number of bytes in the codec's header, wouldn't they? If that's true, then the helper should definitely return pointers/lengths for both parts; we don't want the client's code knowing that sort of thing.

benjeffery · 2025-11-03T10:33:46Z

Added returning of json with blob.

jeromekelleher · 2025-11-03T11:55:20Z

This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter?

bhaller · 2025-11-03T12:10:30Z

This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter?

Good point, I wasn't thinking about that end of things! It does seem like a good idea to not have the client hard-coding the structure of the header and such.

benjeffery · 2025-11-03T13:28:37Z

This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter?

facepalm! Of course.

benjeffery · 2025-11-03T14:14:57Z

@bhaller Talking this over with JK just now, we realised that it is a problem that the bytes in the binary blob are not easily interpretable by tskit users e.g. pyslim, or described in the schema. To fix this we propose a metadata codec that allows specification of JSON and binary in one schema. It would look like:

{
    "codec":"json+struct",
    "json": _A NORMAL JSON SCHEMA_,
    "struct": _A NORMAL STRUCT SCHEMA_,
}

It would be an error to have identical keys in the two schemas, as the returned metadata object merges the two key lists. This allows minimal changes to existing code as opposed to returning with a top-level split as in the schema.

This makes no difference at all to the C side, except that the binary should be encoded in a struct codec friendly way.

This is quite a bit more work, so I will have to return to it after tskit 1.0 is out the door, but as it needs no C changes you can go ahead and start developing assuming the current binary structure.

Hopefully that all makes sense!

bhaller · 2025-11-03T15:04:21Z

@bhaller Talking this over with JK just now, we realised that it is a problem that the bytes in the binary blob are not easily interpretable by tskit users e.g. pyslim, or described in the schema...

Hmm, I see. I'm fine with this if you guys want to get into it. I think I can see how having a schema for the binary might make life easier for pyslim, since then I guess all the binary metadata would be automatically be parsed and made accessible by tskit on the python side? If the current PR's design doesn't have a schema for the JSON metadata, then I agree that that is clearly a problem; having that schema is essential. If it is just that the current PR's design doesn't have a schema for the binary metadata, but does have one for the JSON (which I assumed was the case, but hadn't really thought about as much as I should have!), then I'm not sure how much of a shortcoming that really is in practice, but I can see the broad motivations for attaching a schema to everything – completeness, consistency, transparency, etc. @petrelharp any thoughts here on how useful this would be for pyslim?

It's unfortunate that this won't make it into tskit 1.0, just from the perspective of wanting tskit 1.0 to be kind of a completed vision with no loose ends; but of course that's an impossible dream, and this new proposal sounds like more work for sure, so no worries, push it out. :-O I will end up needing it in the next couple of months, though, for the next release of SLiM, so I hope we're not talking about pushing it too terribly far out, though; what timeframe would you estimate?

Purely out of curiosity (I have no need/desire to do this in practice at present), would this new metadata codec be usable elsewhere in tskit? I.e., could you declare that the metadata column on any given tskit table uses this new codec? Or would its use be limited to the top-level metadata? I don't really have a clear idea of how all the metadata bits and pieces fit together architecturally in tskit. :->

Thanks for taking this on! As always, things turn out to be more complicated than originally envisioned. There must be a Somebody's Law about that. I greatly appreciate your time and effort.

benjeffery · 2025-11-03T15:38:34Z

tskit 1.0 is a forward-compatibility boundary, not a statement of completeness. This work involves no breaking changes so it doesn't matter which side of that boundary it sits.

I'm hoping to have 1.0 in the order of 1-2 weeks, and this work to follow after. Certainly fits in the timescale you have above.

petrelharp · 2025-11-04T04:45:11Z

That sounds great, @benjeffery.

bhaller · 2025-11-30T22:45:59Z

Hi @benjeffery! Congrats on tskit 1.0! :-> Just a ping to say that this issue remains important for me over on the SLiM side. Thanks!

benjeffery · 2025-12-11T15:14:29Z

Ok, I've added the layer that let's python read the binary. I think this is now complete?

bhaller · 2025-12-11T15:19:19Z

Awesome, thanks! I'm not sure exactly when I will be able to actually exercise this in SLiM; things are complicated in the multitrait branch right now, lol. I figure that when the time comes I will copy this PR's version of tskit into SLiM and try to build stuff on top of it, and report back here if I encounter any problems. Is it OK for you to have this PR sitting open for some weeks until that happens?

benjeffery · 2025-12-11T16:07:58Z

Is it OK for you to have this PR sitting open for some weeks until that happens?

Yep, I'm on leave for a week now anyway. I finished this up before I left as was told it was becoming urgent for you.

bhaller · 2025-12-11T16:41:40Z

Is it OK for you to have this PR sitting open for some weeks until that happens?

Yep, I'm on leave for a week now anyway. I finished this up before I left as was told it was becoming urgent for you.

Thanks! Yes, I've been holding off on launching into some big work for several weeks until this functionality became available. Now that it has arrived, I'll plunge into that project. :-> Enjoy your leave! 🏝️

hyanwong · 2026-02-07T22:35:34Z

Is this actually ready to review and hopefully merge @benjeffery ? I ask because I think @bhaller is going to be moving to get the new SLiM version compatible with tskit, so I guess a released tskit version with this functionality would be very useful?

bhaller · 2026-02-07T22:53:47Z

Is this actually ready to review and hopefully merge @benjeffery ? I ask because I think @bhaller is going to be moving to get the new SLiM version compatible with tskit, so I guess a released tskit version with this functionality would be very useful?

Yes! Sorry, the previous work on my big SLiM project took quite a while, so I didn't get to trying this out nearly as soon as I thought I would. But I am now optimistic that I am ready to start building on top of this feature in SLiM more or less now. :-> I thought this had been merged and released a while ago, actually! I'm not really sure how to build my own tskit from a feature branch like this, being rather clueless about both GitHub and Python, so yes, it'd be great if this came out in a release. Thanks for noticing this, @hyanwong!

bhaller · 2026-03-02T12:51:19Z

It would be helpful to get this finalized, merged, and released.

Sorry will do this this week.

Hi @benjeffery! Just a ping – I guess life is busy lately! :->

bhaller · 2026-03-09T14:31:03Z

Hi @jeromekelleher. I see you're working on this PR. Are my various comments/suggestions above under consideration?

- Add specific errors for JSON blob decode - Update changelog and API docs - Switch to char* for blobs for consistency - Update Python changelog Fixup lint

jeromekelleher · 2026-03-09T14:49:22Z

I've had a look through and I think these were the active points @bhaller:

Can I suggest that the test should be != rather than <? Seems like it ought to match exactly if the metadata is well-formed; and I had a subtle bug just now that had me tearing my hair out that would've been caught with a != test here. Is there a reason that you want to allow a mismatch here?

I think there's a good reason to be lenient here in allowing the length of the input buffer to be greater than the sum of the two segments - you might want to add padding for alignment. So, I think the current version is good.

can const uint8_t **blob be uint8_t **blob without the const

These have been changes to const char **blob, for consistency. The const is necessary here for us to compile and signals that the function doesn't alter the buffers.

bhaller · 2026-03-09T15:36:26Z

I've had a look through and I think these were the active points @bhaller:

Can I suggest that the test should be != rather than <? Seems like it ought to match exactly if the metadata is well-formed; and I had a subtle bug just now that had me tearing my hair out that would've been caught with a != test here. Is there a reason that you want to allow a mismatch here?

I think there's a good reason to be lenient here in allowing the length of the input buffer to be greater than the sum of the two segments - you might want to add padding for alignment. So, I think the current version is good.

Hmm – this doesn't seem useful for alignment, since the length difference goes at the end...? For alignment, you want to have extra bytes inserted in between the JSON and the binary, which is not presently supported. (I suggested adding alignment bytes above, in fact; that would be quite a useful thing to provide in this PR!) As far as I can see, all that < does for the client is introduce the possibility of bugs. (I'm handling alignment myself in SLiM now, in fact, and the use of < here was in no way helpful for doing that; all it did was hide a bug from me.)

I hope that the addition of support for alignment can be added as I suggested. It was fairly fiddly doing it right in SLiM, and Peter is just going to have to duplicate that work in pyslim (to be compatible with SLiM)), as will everybody else who cares about alignment.

can const uint8_t **blob be uint8_t **blob without the const

These have been changes to const char **blob, for consistency. The const is necessary here for us to compile and signals that the function doesn't alter the buffers.

It's a bit annoying to require a caller to pass a pointer to const just because the function itself doesn't alter the buffers. If it was just a question of a const char * parameter and the user passing in a char * to the function, that would be fine; the compiler would automatically accept the char * for the parameter, rather than requiring the caller to provide a const char *. But because of the extra level of indirection, a const char ** parameter does not automatically accept a char ** parameter, requiring the caller to do some faffing about with casts and such. I would argue that it is not standard practice to put the const there – when passing a pointer to a pointer, it is already assumed that the function is not supposed to muck around inside the thing that is ultimately pointed to, and that guarantee is not typically made in APIs (and in many cases cannot be made, in C/C++ syntax, such as guaranteeing that a function won't modify stuff that is pointed to by pointers contained within a const structure that it is passed). So I don't think you ought to have that const there. But it's not the end of the world. (I'd question the decision to use char instead of uint8_t, though; they are not the same thing, and I think the semantics of uint8_t are more appropriate for binary data, at least.)

The other point I made above that seems to have been missed (?) is the need for some aspects of this design, such as the magic byte sequence, to be made public in the tskit headers so that SLiM doesn't have to make its own private copy of that information (introducing the possibility of various bugs). Information that is needed for a client to effectively use a public API should be part of the public API.

bhaller · 2026-03-09T15:37:56Z

I'll have a read through to make sure nothing else above got missed, but right now I need to go...

bhaller · 2026-03-09T15:38:28Z

Auto-merge should not be enabled on this PR; it is still under review/discussion.

jeromekelleher · 2026-03-09T15:46:45Z

This discussion has got very long and diffuse @bhaller - let's merge this much and raise follow on issues separately. If you'd like to propose changes to the current API please go ahead and open an issue.

bhaller · 2026-03-09T19:51:25Z

This discussion has got very long and diffuse @bhaller - let's merge this much and raise follow on issues separately. If you'd like to propose changes to the current API please go ahead and open an issue.

OK, I have opened four new issues for the things I think ought to be fixed. I hope all this can be resolved promptly; pyslim and SLiM need this design to be finalized and released to move forward. Thanks!

benjeffery force-pushed the json-and-blob branch from bae6b68 to 6c62d45 Compare October 24, 2025 15:34

benjeffery mentioned this pull request Oct 24, 2025

SLiM needs a place to put mutation metadata MesserLab/SLiM#569

Open

benjeffery force-pushed the json-and-blob branch from 6c62d45 to 6f871a5 Compare October 24, 2025 19:09

benjeffery force-pushed the json-and-blob branch from 8625ace to 47d86c9 Compare October 30, 2025 11:55

bhaller mentioned this pull request Nov 30, 2025

update to tskit 1.0.0 MesserLab/SLiM#565

Merged

benjeffery force-pushed the json-and-blob branch from 81ce9f8 to 06e279a Compare December 11, 2025 15:14

hyanwong changed the title ~~Add json+binary codec~~ Add json+struct codec Dec 11, 2025

benjeffery added 4 commits March 9, 2026 10:48

Add json+binary codec

c1f4e45

Add C helper

02fd59c

Return both json and blob

87aea8d

Add struct codec to the binary side

fa6b081

jeromekelleher force-pushed the json-and-blob branch 3 times, most recently from 0ec3067 to b212866 Compare March 9, 2026 14:25

jeromekelleher approved these changes Mar 9, 2026

View reviewed changes

Various review updates

07b12ff

- Add specific errors for JSON blob decode - Update changelog and API docs - Switch to char* for blobs for consistency - Update Python changelog Fixup lint

jeromekelleher force-pushed the json-and-blob branch from b212866 to 07b12ff Compare March 9, 2026 14:34

jeromekelleher enabled auto-merge March 9, 2026 14:38

jeromekelleher disabled auto-merge March 9, 2026 14:40

jeromekelleher enabled auto-merge March 9, 2026 14:49

jeromekelleher added this pull request to the merge queue Mar 9, 2026

jeromekelleher removed this pull request from the merge queue due to a manual request Mar 9, 2026

jeromekelleher merged commit 1835ea3 into tskit-dev:main Mar 9, 2026
15 of 17 checks passed

benjeffery deleted the json-and-blob branch March 13, 2026 00:42

petrelharp mentioned this pull request Mar 13, 2026

Json struct alignment #3437

Open

3 tasks

Conversation

benjeffery commented Oct 24, 2025

Uh oh!

codecov bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented Oct 24, 2025

Uh oh!

bhaller commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhaller commented Oct 28, 2025

Uh oh!

benjeffery commented Oct 30, 2025

Uh oh!

bhaller commented Oct 30, 2025

Uh oh!

benjeffery commented Oct 31, 2025

Uh oh!

bhaller commented Oct 31, 2025

Uh oh!

benjeffery commented Nov 3, 2025

Uh oh!

jeromekelleher commented Nov 3, 2025

Uh oh!

bhaller commented Nov 3, 2025

Uh oh!

benjeffery commented Nov 3, 2025

Uh oh!

benjeffery commented Nov 3, 2025

Uh oh!

bhaller commented Nov 3, 2025

Uh oh!

benjeffery commented Nov 3, 2025

Uh oh!

petrelharp commented Nov 4, 2025

Uh oh!

bhaller commented Nov 30, 2025

Uh oh!

benjeffery commented Dec 11, 2025

Uh oh!

bhaller commented Dec 11, 2025

Uh oh!

benjeffery commented Dec 11, 2025

Uh oh!

bhaller commented Dec 11, 2025

Uh oh!

hyanwong commented Feb 7, 2026

Uh oh!

bhaller commented Feb 7, 2026

Uh oh!

bhaller commented Mar 2, 2026

Uh oh!

bhaller commented Mar 9, 2026

Uh oh!

jeromekelleher commented Mar 9, 2026

Uh oh!

bhaller commented Mar 9, 2026

Uh oh!

bhaller commented Mar 9, 2026

Uh oh!

bhaller commented Mar 9, 2026

Uh oh!

jeromekelleher commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

bhaller commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Oct 24, 2025 •

edited

Loading

bhaller commented Oct 24, 2025 •

edited

Loading