Multithreaded replication WIP by meiji163 · Pull Request #1454 · github/gh-ost

meiji163 · 2024-10-04T22:49:01Z

Description

This PR introduces multi-threaded replication for applying DML queries to the ghost table. The goal is to be able to migrate tables with high rate of DML queries (e.g. >5k rows/s). Currently gh-ost lags behind in these situations, taking a very long time to complete or not completing at all.

Similar to MySQL replication threads, gh-ost will stream binlog events from the source and group them into transactions. It then submits the transactions to a pool of workers to apply the transactions concurrently on the ghost table. We ensure that dependent transactions are applied in a consistent order (equivalent to MySQL multi-threaded replication with replica_parallel_type=LOGICAL_CLOCK and replica_preserve_commit_order=0).

With WRITESET enabled on the source, this enables a great amount of parallelism in the transaction applier.

Changes

TODO

Performance tests

TODO

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

Multiple `TableMap` events can happen inside of a single transaction, so we can't really use them to filter out events early or anything like that. Instead, we'll just skip them. Also adds some basic test cases for the transaction streaming.

Co-authored-by: Daniel Joos <danieljoos@github.com>

This reverts commit 2e78f6f.

* Remove error return value since we don't use it. * Lock the mutex whenever we plan to update the low watermark to avoid a race condition. * Check for data races in our unit tests. * Still return an error from ProcessEventsUntilDrained but actually check it in our code. * Make coordinator_test.go to check the err from ProcessEventsUntilDrained again * Remove unreachable return in ProcessEventsUntilDrained

…ark (#1531) * Notify waiting channels on completed transaction, not just the watermark. * Add checksum validation to coordinator test * Use errgroup to perform transactions concurrently in coordinator_test.go * Configure concurrency separate from total number of transactions. * Run similar number of txs to previous test and ignore context. * Have at least 1 child in a transaction. * Notify waiting channels for the current sequence number.

meiji163 · 2025-04-12T18:07:34Z

Despite promising performance results in testing, we stopped developing this branch since Nov 2024 after running into intermittent data inconsistency problems in internal replica tests. I believe I've tracked down the source of this issue. Below is the investigation for anyone interested.

Investigation

The data inconsistency appeared intermittently on several different ghost testing replicas running the MTR version.
I was able to reproduce the error locally using a docker localtest with sysbench write load. The data inconsistency happens pretty reliably with ~900 trx/sec on the table.

The test table looks like this:

CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
);

The testing result is checksum mismatch (usually only one row). In this case the row with id=5025. The second column k is wrong. I ran the test with general_log='ON' on the primary and replica to see what is going on. Here are the last two transactions on sbtest1 from sysbench affecting row 5025:

-- 2025-04-12T03:28:44.209273Z        25 Execute
BEGIN;
DELETE FROM sbtest1 WHERE id=5025;
INSERT INTO sbtest1 (id, k, c, pad) VALUES (5025, 5046, '55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722', '90699347551-90936038435-69760642136-45340328341-67205199431');
COMMIT;

-- 2025-04-12T03:28:44.209695Z        28 Execute
UPDATE sbtest1 SET k=k+1 WHERE id=5025;

And corresponding binlog events on the replica

last_committed=89053    sequence_number=89058
DELETE FROM `test`.`sbtest1`
WHERE
  @1=5025
  @2=4993
  @3='72162371109-65711525437-30164254657-02236716337-47638530925-52423543892-06270192544-11372615750-04017656641-19388264173'
  @4='44029122667-48848103638-83352868135-91599152925-97809617080'
# at 145248382

INSERT INTO `test`.`sbtest1`
SET
  @1=5025
  @2=5046
  @3='55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722'
  @4='90699347551-90936038435-69760642136-45340328341-67205199431'
# at 145248672
...

last_committed=89062    sequence_number=89065
UPDATE `test`.`sbtest1`
WHERE
  @1=5025
  @2=5046
  @3='55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722'
  @4='90699347551-90936038435-69760642136-45340328341-67205199431'
SET
  @1=5025
  @2=5047
  @3='55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722'
  @4='90699347551-90936038435-69760642136-45340328341-67205199431'
# at 145258107

So, the correct value is k=5047. But on _sbtest1_gho the order of the transactions is switched. First gh-ost updates the row to k=5047 then deletes and reinserts it with k=5046, resulting in wrong final value k=5046.

We can look at the dependency (sub)graph for the original transactions on sbtest1 (sequence numbers 89058 and 89065). It looks like this:

graph LR;
     89058--> 89053;
     89053--> 89050;
     89065--> 89062;
     89062--> 89053;

This means once transaction 89053 and 89062 finish, the coordinator may schedule 89058 and 89065 concurrently.

Comparing to what the MySQL replication applier coordinator does (sql/rpl_rli_pbd.cc), I realized that a transaction should be scheduled if and only if lowWaterMark >= lastCommitted.

The lastCommitted of a transaction is its most recent (i.e. greatest) dependent transaction. The lowWaterMark is a global variable that maintains the invariant that all sequence numbers <= lowWaterMark are complete. Therefore if we schedule a transaction when lastCommitted > lowWaterMark it is possible it has dependent transactions that haven't completed, even if lastCommitted is complete.

In the example, 89065 must wait until lowWaterMark >= 89062, at which point it's guaranteed that 89058 completed.

Fix

In our Coordinator, the culprit is this line in WaitForTransaction:

gh-ost/go/logic/coordinator.go

Lines 482 to 484 in ffef446

    
           if _, ok := c.completedJobs[lastCommitted]; ok { 
        
           	return nil 
        
           }

In the example it allowed 89065 to be applied after 89062, but before 89058 completed.

After removing these lines the sysbench localtest is consistently passing.

TomKnaepen · 2025-05-19T08:20:49Z

Hi @meiji163, my team has been running into the exact limits you describe in this PR, so we were very excited to see the progress already made here. However, over the last 2 weeks I've run some tests with this PR and still notice a data inconsistency issue.

We use gh-ost on AWS Aurora, and my testing has been with a production-like setup. On the table we're trying to migrate the only interesting queries (not SELECTs) are fairly simple: we insert rows and update some of them fairly quickly to remove a timestamp value.

The inconsistency I see is caused by some UPDATEs not being applied to the ghost table. This happens infrequently but consistently throughout the tests (50 rows affected out of ~2 million).
Here's an example problematic row:

#250516 13:06:46 server id 1951482084  end_log_pos 26539240 CRC32 0x364830a9   Anonymous_GTID  last_committed=1822 sequence_number=1824    rbr_only=yes    original_committed_timestamp=1747393606597218   immediate_commit_timestamp=1747393606597218 transaction_length=3944
# at 26539538
### INSERT INTO `DB_ACTIVE`.`message`
### SET
###   @1=39181433

--

#250516 13:06:46 server id 1951482084  end_log_pos 26589750 CRC32 0x2b5cf4fe   Anonymous_GTID  last_committed=1867 sequence_number=1868    rbr_only=yes    original_committed_timestamp=1747393606648774   immediate_commit_timestamp=1747393606648774 transaction_length=7447
# at 26590057
### UPDATE `DB_ACTIVE`.`message`
### WHERE
###   @1=39181433
###   @52=1747480006593
### SET
###   @1=39181433
###   @52=NULL

--

#250516 13:06:46 server id 1951482084  end_log_pos 28141453 CRC32 0xd8e3f39c   Anonymous_GTID  last_committed=2029 sequence_number=2030    rbr_only=yes    original_committed_timestamp=1747393606747371   immediate_commit_timestamp=1747393606747371 transaction_length=3949
# at 28141756
### INSERT INTO `DB_ACTIVE`.`_message_gho`
### SET
###   @1=39181433

It doesn't quite look like the same issue you ran into and described above because the dependency graph looked fairly linear:

last_committed=1822 sequence_number=1824 -- INSERT
last_committed=1824 sequence_number=1825
last_committed=1825 sequence_number=1826
last_committed=1826 sequence_number=1828
last_committed=1828 sequence_number=1832
last_committed=1832 sequence_number=1836
last_committed=1836 sequence_number=1837
last_committed=1837 sequence_number=1840
last_committed=1840 sequence_number=1844
last_committed=1844 sequence_number=1854
last_committed=1854 sequence_number=1856
last_committed=1856 sequence_number=1857
last_committed=1857 sequence_number=1858
last_committed=1858 sequence_number=1862
last_committed=1862 sequence_number=1863
last_committed=1863 sequence_number=1866
last_committed=1866 sequence_number=1867
last_committed=1867 sequence_number=1868 -- UPDATE

Do you have any ideas what could be the issue, and is there anything else I can provide?

Thank you!

meiji163 · 2025-06-27T19:20:38Z

Thanks for reporting this @TomKnaepen, I also ran into more data consistency issues after the latest changes, but I haven't had time to investigate further. We thought it might be related to the binlogsyncer connection being closed and reopened. Is there any error in your gh-ost output?

TomKnaepen · 2025-07-01T15:52:17Z

I don't recall seeing any errors at the time; I ran some new tests on a smaller scale this week to see if I could find anything but there's no errors or anything noteworthy in the gh-ost output.
Tested both revisions 2eacc78 and 7c3032f since I thought you might be referring to that commit, but both resulted in inconsistent data.

dnovitski · 2026-04-29T00:44:35Z

Hi @meiji163 @arthurschreiber @danieljoos @hugodorea @mhamza15 @TomKnaepen 👋

We have used GitHub Copilot CLI with 2 collaborating agents (Claude Opus 4.6 "Sherlock" and GPT 5.5 "Otter") to investigate and fix the data inconsistency issue.

They found the issue and a PR with the fix can be found at: dnovitski#1

Root Cause Analysis

The Core Bug: Binlog Rotation Resets Sequence Numbers But Coordinator State Is Never Reset

MySQL's logical clock (last_committed, sequence_number) used for parallel replication is per-binlog-file. When max_binlog_size triggers a binlog rotation, sequence_number resets back to 1. However, the coordinator's lowWaterMark (lwm) was never reset — it retained the previous file's high value (e.g., 65553).

After rotation, all WaitForTransaction(lastCommitted) checks passed immediately because lwm >= lastCommitted was trivially true for the new file's small sequence numbers. This caused transactions that should have waited for their dependencies to execute concurrently, resulting in out-of-order application and wrong final values.

Concrete example from our debugging:

Before rotation: lwm = 65553, sequence numbers ~65000+
After rotation:  sequence numbers restart at 1, 2, 3, ...
Transaction seq=7 with lastCommitted=5:
  Check: 65553 >= 5 → TRUE → executes immediately (WRONG! seq=5 may not have completed yet in the new file)

This is exactly what the user observed: two transactions affecting id=5025 executed in wrong order, resulting in k=5046 instead of the correct k=5047.

The dependency graph showed that once their respective lastCommitted transactions completed, the coordinator scheduled both concurrently — but one depended on the other through an intermediate transaction that hadn't completed yet, because the stale lwm bypassed the wait.

Additional Bugs Found and Fixed

Bug 2 — Silent Error Swallowing in DML Apply:
The original applyDMLEvents() logged errors but silently discarded them, calling MarkTransactionCompleted regardless. This corrupted the dependency tracking — the coordinator believed a transaction completed successfully when it actually failed, potentially allowing dependent transactions to proceed with stale data.

Fix: Retry InnoDB deadlocks (error 1213) and lock wait timeouts (error 1205) with jittered exponential backoff (up to 100 retries, matching MySQL's slave_transaction_retries behavior). Propagate fatal (non-retryable) errors via a broadcast channel (failedCh) so all workers and the coordinator can shut down cleanly.

Bug 3 — Wait Channel Deadlock on Error Paths:
WaitForTransaction used unbuffered channels. If a waiter exited early via failedCh (due to another worker's fatal error), the subsequent MarkTransactionCompleted send would block forever, hanging the coordinator.

Fix: Use buffered channels (capacity 1) so the notification send never blocks even if no one is listening.

Bug 4 — Data Race on lwm Read:
The if c.lowWaterMark >= 0 check in the RotateEvent handler was read without holding the mutex, racing with concurrent MarkTransactionCompleted calls that update lwm.

Fix: Guard the read with c.mu.Lock()/c.mu.Unlock().

Note: Pre-existing Bug in `buildDMLEventQuery`

The second agent (Otter) also discovered that buildDMLEventQuery in applier.go mutates dmlEvent.DML for unique-key UPDATE operations (sets to DeleteDML then InsertDML, never restores the original value). This doesn't affect sysbench workloads (PK-only operations) but could cause issues with workloads that modify unique keys. This is not addressed in our fix PR but is worth noting for a separate fix.

Verification

20+ consistency test iterations passed at rate=1200 trx/s, 4 parallel workers, 90-second sysbench write load (previously showed ~60% failure rate)
Independent verification by the second agent: 48-minute continuous stress test with 10 binlog rotations and 0 data mismatches
Both agents reached independent consensus on the root cause and fix before confirming

Performance: Multithreaded Replication vs Baseline

We benchmarked MTR against the baseline (non-MTR) gh-ost using 200K rows with 1000 trx/s sysbench write load for 90 seconds:

Configuration	Total Migration Time	DML Events/s (during load)	Row-copy starvation period
Baseline (no MTR, single-threaded)	166s	1,463/s	~150s (0 rows copied during active load!)
MTR, 4 workers	135s	2,049/s	~120s
MTR, 4 workers, batch=50	150s	2,156/s	~135s

Key findings:

MTR provides ~19% improvement in total migration time with 4 workers
DML processing throughput increases by ~40% (2,049 vs 1,463 events/s)
The fundamental bottleneck remains executeWriteFuncs which calls ProcessEventsUntilDrained() before each row-copy chunk — under high write load, the event queue fills continuously and row-copy gets starved. MTR helps by draining the queue faster with parallel workers, giving row-copy more turns to execute.
InnoDB deadlocks from parallel workers are expected and handled correctly with retry+backoff (gap locks on secondary indexes cause contention between parallel REPLACE INTO / DELETE FROM operations on the ghost table)

arthurschreiber and others added 30 commits September 30, 2024 09:15

WIP.

37c1abd

Fixes.

5d69fd9

Fixups.

df29aa8

Special handling for first job.

81a6e6e

Fix deadlocks.

b44c6df

Update docs and simulate changes.

55ce5a2

start StreamTransaction implementation

0d8f39b

add job Coordinator

22a4b8c

fix job complete logic

3fd6e71

add docker compose

f209592

fix application of applyEvent.writeFunc

c6c877e

WIP.

f81a790

Co-authored-by: Daniel Joos <danieljoos@github.com>

WIP.

e3b2cda

fix MarkTransactionCompleted

3ba9058

go mod tidy

74f6c9c

configure max idle connections

eddc1c9

vendor packages

6321e73

track binlog coords

5600b91

binlog streamer reconnect

3f47ebd

fix TestMigrate

e81aabf

worker-stats command

e4da5f8

setup mysql in CI

c5e239c

set transaction_write_set_extraction in test

2e78f6f

add ci mysql opt flags

875d00d

Revert "set transaction_write_set_extraction in test"

9a022e2

This reverts commit 2e78f6f.

change ci mysql opts

18dcee1

try custom docker run

299df37

Make the linter happy.

941689f

Use testcontainers.

9e3bc1c

meiji163 temporarily deployed to prod-ae-01 April 7, 2025 19:24 Inactive

meiji163 temporarily deployed to production April 7, 2025 19:41 Inactive

misalcedo mentioned this pull request Apr 9, 2025

Avoid a data race when dealing with the lowwatermark #1530

Merged

2 tasks

meiji163 dismissed mhamza15’s stale review via aa81bbc April 9, 2025 20:34

Merge branch 'master' into meiji163/parallel-repl

b1f4061

hugodorea previously approved these changes Apr 9, 2025

View reviewed changes

misalcedo mentioned this pull request Apr 10, 2025

Notify waiting channels on completed transaction, not just the watermark #1531

Merged

2 tasks

meiji163 dismissed hugodorea’s stale review via b23d20f April 10, 2025 15:50

hugodorea previously approved these changes Apr 10, 2025

View reviewed changes

meiji163 added 2 commits April 10, 2025 12:26

add more queries to coordinator test

d1f4a11

hack localtests/test.sh to run sysbench

ffef446

meiji163 dismissed hugodorea’s stale review via ffef446 April 10, 2025 19:30

meiji163 added 2 commits April 12, 2025 11:46

tweak test script

076a806

wait until lowWaterMark >= lastCommitted in WaitForTransaction

2eacc78

arthurschreiber previously approved these changes Apr 12, 2025

View reviewed changes

meiji163 temporarily deployed to production April 12, 2025 20:28 Inactive

meiji163 temporarily deployed to production April 14, 2025 18:19 Inactive

meiji163 temporarily deployed to production April 16, 2025 16:13 Inactive

Do not track binlog position in the coordinator.

7c3032f

arthurschreiber dismissed their stale review via 7c3032f April 18, 2025 15:46

dnovitski mentioned this pull request Apr 29, 2026

fix: resolve MTR data inconsistency caused by binlog rotation dnovitski/gh-ost#1

Open

This was referenced Apr 29, 2026

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle dnovitski/gh-ost#2

Open

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle #1665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded replication WIP#1454

Multithreaded replication WIP#1454
meiji163 wants to merge 67 commits intomasterfrom
meiji163/parallel-repl

meiji163 commented Oct 4, 2024 •

edited

Loading

Uh oh!

meiji163 commented Apr 12, 2025 •

edited

Loading

Uh oh!

TomKnaepen commented May 19, 2025

Uh oh!

meiji163 commented Jun 27, 2025

Uh oh!

TomKnaepen commented Jul 1, 2025

Uh oh!

dnovitski commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

meiji163 commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Performance tests

Uh oh!

meiji163 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation

Fix

Uh oh!

TomKnaepen commented May 19, 2025

Uh oh!

meiji163 commented Jun 27, 2025

Uh oh!

TomKnaepen commented Jul 1, 2025

Uh oh!

dnovitski commented Apr 29, 2026

Root Cause Analysis

The Core Bug: Binlog Rotation Resets Sequence Numbers But Coordinator State Is Never Reset

Additional Bugs Found and Fixed

Note: Pre-existing Bug in buildDMLEventQuery

Verification

Performance: Multithreaded Replication vs Baseline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

meiji163 commented Oct 4, 2024 •

edited

Loading

meiji163 commented Apr 12, 2025 •

edited

Loading

Note: Pre-existing Bug in `buildDMLEventQuery`