Skip to content

Add dbtool fold: emit self-contained baseline from registered migrations#481

Open
christianparpart wants to merge 1 commit intomasterfrom
feature/dbtool-fold
Open

Add dbtool fold: emit self-contained baseline from registered migrations#481
christianparpart wants to merge 1 commit intomasterfrom
feature/dbtool-fold

Conversation

@christianparpart
Copy link
Copy Markdown
Member

Adds a new offline dbtool fold subcommand that walks all registered migrations and emits a single self-contained baseline — either a .cpp migration plugin or a .sql script — that reproduces the post-migration schema and schema_migrations rows from an empty database. Useful for collapsing a long migration history into a fast-to-apply starting point, or for shipping a snapshot baseline alongside a release.

The command is purely offline: it never opens a DB connection, never queries a live schema. It loads plugins, walks each migration's Up() plan in timestamp order, folds the cumulative effect into a per-table view + chronological data steps, and emits via the existing ToSql() formatter path so each dialect's CREATE TABLE / INSERT codegen stays the single source of truth.

Changes

  • New MigrationManager::FoldRegisteredMigrations(formatter, upToInclusive) primitive — pure plan-walk, returns PlanFoldingResult (per-table state, creation order, indexes, chronological data steps, in-range releases). Used by the new module and available to any future caller.
  • New Lightweight::MigrationFold library module under src/Lightweight/MigrationFold/:
    • Folder — thin facade plus ResolveUpTo() which accepts an empty string (latest registered release), a numeric timestamp, or a release version string.
    • SqlEmitter — emits a flat dialect-specific .sql script, including a CREATE TABLE schema_migrations and a stamping INSERT per folded timestamp so a freshly-loaded DB looks identical to a real apply-all run.
    • CppEmitter — emits a .cpp baseline plugin wrapped in LIGHTWEIGHT_SQL_MIGRATION, with optional --emit-cmake and --max-lines-per-file for splitting very large baselines across multiple files.
  • Shared CodeGen/SplitFileWriter helper that bin-packs blocks within a per-file line budget; used by CppEmitter and intentionally factored out for reuse.
  • dbtool fold --output FILE [--up-to X] [--dialect D] [--emit-cmake] [--plugin-name N] [--max-lines-per-file N] — output format is picked from the file extension. .sql requires --dialect (sqlite, postgres, mssql, mysql); .cpp is dialect-agnostic. Dispatched before SetupConnectionString since fold never touches a DB; uses a connection-less GetMigrationManagerOffline variant.
  • Unit tests: 10 fold cases (create + altercolumn, drop-table cleanup, chronological ordering, --up-to truncation, RawSql passthrough, column rename FK propagation, release-range filtering, ResolveUpTo parsing) + 4 SplitFileWriter cases + 2 emitter round-trip cases. Green against sqlite3, mssql2022, and postgres.

@christianparpart christianparpart requested a review from a team as a code owner April 30, 2026 05:36
@github-actions github-actions Bot added CLI command line interface tools tests Core API labels Apr 30, 2026
@christianparpart christianparpart force-pushed the feature/dbtool-fold branch 2 times, most recently from 1c305bc to d8bf4e2 Compare April 30, 2026 08:57
@github-actions github-actions Bot added Query Builder Data Binder SQL Data Binder support Query Formatter SQL dialect implementations labels Apr 30, 2026
@christianparpart christianparpart force-pushed the feature/dbtool-fold branch 4 times, most recently from 1bf3c43 to 35bcc7e Compare April 30, 2026 10:50
@github-actions github-actions Bot removed Query Builder Data Binder SQL Data Binder support Query Formatter SQL dialect implementations labels Apr 30, 2026
@christianparpart christianparpart force-pushed the feature/dbtool-fold branch 2 times, most recently from 35cbced to 83bd900 Compare April 30, 2026 17:58
Copy link
Copy Markdown
Member

@Yaraslaut Yaraslaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left few small comments, mostly nitpicks

if (!options.formatter)
throw std::runtime_error("EmitSqlBaseline: formatter is required");

std::ofstream out(options.outputPath);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use std::string and flush it only once in a file, when everything is done, i am not the biggest fan of << and streams. Also this will remove the need for WriteSchemaMigrationsSeed and other functions to have first argument

std::filesystem::path outputPath;
/// Threshold for splitting the body across multiple `.cpp` files. Zero
/// disables splitting and emits a single file.
std::size_t maxLinesPerFile = 5000;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, i guess i am missing something, our migrations are declerative, and if we are folding migrations, then we have only one point of declaration, how can we split this into multiple files?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because that file may become quite big (imagine having 500+ tables with many of them having really a lot of columns), then the one migration TU can be too big for a single .cpp file to be compiled on your machine. In my case, it did let clang-tidy OOM-kill my 64GB laptop :)
The approach then is to split the single migration across multiple functions that are invoked from the single folded migration.

Comment on lines +73 to +101
using T = std::decay_t<decltype(t)>;
if constexpr (std::is_same_v<T, Bigint>)
return std::string(kPrefix) + "Bigint {}";
else if constexpr (std::is_same_v<T, Bool>)
return std::string(kPrefix) + "Bool {}";
else if constexpr (std::is_same_v<T, Date>)
return std::string(kPrefix) + "Date {}";
else if constexpr (std::is_same_v<T, DateTime>)
return std::string(kPrefix) + "DateTime {}";
else if constexpr (std::is_same_v<T, Guid>)
return std::string(kPrefix) + "Guid {}";
else if constexpr (std::is_same_v<T, Integer>)
return std::string(kPrefix) + "Integer {}";
else if constexpr (std::is_same_v<T, Real>)
return std::format("{}Real {{ {} }}", kPrefix, t.precision);
else if constexpr (std::is_same_v<T, Smallint>)
return std::string(kPrefix) + "Smallint {}";
else if constexpr (std::is_same_v<T, Tinyint>)
return std::string(kPrefix) + "Tinyint {}";
else if constexpr (std::is_same_v<T, Time>)
return std::string(kPrefix) + "Time {}";
else if constexpr (std::is_same_v<T, Timestamp>)
return std::string(kPrefix) + "Timestamp {}";
else if constexpr (std::is_same_v<T, Char>)
return std::format("{}Char {{ {} }}", kPrefix, t.size);
else if constexpr (std::is_same_v<T, NChar>)
return std::format("{}NChar {{ {} }}", kPrefix, t.size);
else if constexpr (std::is_same_v<T, Varchar>)
return std::format("{}Varchar {{ {} }}", kPrefix, t.size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think here it is easier to use overloaded with the visitor, not one lambda

// timestamp slot. (For the LUP plugin the typical baseline body is well under
// any reasonable threshold, so this rarely fires.)
auto const body = BuildSingleFileBody(fold);
std::ofstream out(options.outputPath);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well, I think it is cleaner to create strings and then flush everything at once

dbtool fold --output FILE  emits a self-contained baseline (.cpp plugin
or .sql script) that reproduces the post-migration state from an empty
DB. .sql output requires --dialect (sqlite, postgres, mssql, mysql);
.cpp output is dialect-agnostic. Runs without any DB connection - loads
plugins, walks migrations in memory, writes a file.

Built on a new pure plan-walk primitive
MigrationManager::FoldRegisteredMigrations(formatter, upToInclusive)
that folds every registered migration into a per-table view of the
final shape plus a chronological list of data steps, indexes, and
releases.

The fold module (src/Lightweight/MigrationFold/{Folder,CppEmitter,
SqlEmitter}.{hpp,cpp}) emits via the existing ToSql() formatter path so
each dialect's CREATE TABLE / CREATE INDEX / INSERT codegen stays the
single source of truth. The .cpp emitter wraps the body in
LIGHTWEIGHT_SQL_MIGRATION; the .sql emitter additionally emits CREATE
TABLE schema_migrations and a stamping INSERT for every folded
timestamp so the post-fold DB looks identical to a real apply-all run.

Also pulls in CodeGen/SplitFileWriter shared codegen helper used by the
.cpp emitter to bin-pack large baselines across multiple files.

Tests: fold unit tests cover create/altercolumn/drop-table cleanup,
data-step chronological order, --up-to truncation, RawSql passthrough,
column rename FK propagation, release-range filtering, ResolveUpTo
parsing. SqlEmitter/CppEmitter round-trip tests verify the emitted
artifacts match the expected shape. SplitFileWriter tests cover bin-
packing, single-chunk, zero-budget, and oversize-block boundaries.

All [Fold] and [SplitFileWriter] tests pass against sqlite3,
mssql2022, and postgres. Full SqlMigration suite (44 cases / 210
assertions) green on all three.

Signed-off-by: Christian Parpart <christian@parpart.family>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLI command line interface tools Core API tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants