Skip to content

Multithreaded replication WIP #1454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 67 commits into
base: master
Choose a base branch
from
Draft

Multithreaded replication WIP #1454

wants to merge 67 commits into from

Conversation

meiji163
Copy link
Contributor

@meiji163 meiji163 commented Oct 4, 2024

Description

This PR introduces multi-threaded replication for applying DML queries to the ghost table. The goal is to be able to migrate tables with high rate of DML queries (e.g. >5k rows/s). Currently gh-ost lags behind in these situations, taking a very long time to complete or not completing at all.

Similar to MySQL replication threads, gh-ost will stream binlog events from the source and group them into transactions. It then submits the transactions to a pool of workers to apply the transactions concurrently on the ghost table. We ensure that dependent transactions are applied in a consistent order (equivalent to MySQL multi-threaded replication with replica_parallel_type=LOGICAL_CLOCK and replica_preserve_commit_order=0).

With WRITESET enabled on the source, this enables a great amount of parallelism in the transaction applier.

Changes

TODO

Performance tests

TODO

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

mhamza15
mhamza15 previously approved these changes Apr 1, 2025
* Remove error return value since we don't use it.

* Lock the mutex whenever we plan to update the low watermark to avoid a race condition.

* Check for data races in our unit tests.

* Still return an error from ProcessEventsUntilDrained but actually check it in our code.

* Make coordinator_test.go to check the err from  ProcessEventsUntilDrained again

* Remove unreachable return in ProcessEventsUntilDrained
hugodorea
hugodorea previously approved these changes Apr 9, 2025
…ark (#1531)

* Notify waiting channels on completed transaction, not just the watermark.

* Add checksum validation to coordinator test

* Use errgroup to perform transactions concurrently in coordinator_test.go

* Configure concurrency separate from total number of transactions.

* Run similar number of txs to previous test and ignore context.

* Have at least 1 child in a transaction.

* Notify waiting channels for the current sequence number.
hugodorea
hugodorea previously approved these changes Apr 10, 2025
@meiji163
Copy link
Contributor Author

meiji163 commented Apr 12, 2025

Despite promising performance results in testing, we stopped developing this branch since Nov 2024 after running into intermittent data inconsistency problems in internal replica tests. I believe I've tracked down the source of this issue. Below is the investigation for anyone interested.

Investigation

The data inconsistency appeared intermittently on several different ghost testing replicas running the MTR version.
I was able to reproduce the error locally using a docker localtest with sysbench write load. The data inconsistency happens pretty reliably with ~900 trx/sec on the table.

The test table looks like this:

CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
);

The testing result is checksum mismatch (usually only one row). In this case the row with id=5025. The second column k is wrong. I ran the test with general_log='ON' on the primary and replica to see what is going on. Here are the last two transactions on sbtest1 from sysbench affecting row 5025:

-- 2025-04-12T03:28:44.209273Z        25 Execute
BEGIN;
DELETE FROM sbtest1 WHERE id=5025;
INSERT INTO sbtest1 (id, k, c, pad) VALUES (5025, 5046, '55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722', '90699347551-90936038435-69760642136-45340328341-67205199431');
COMMIT;

-- 2025-04-12T03:28:44.209695Z        28 Execute
UPDATE sbtest1 SET k=k+1 WHERE id=5025;

And corresponding binlog events on the replica

last_committed=89053    sequence_number=89058
DELETE FROM `test`.`sbtest1`
WHERE
  @1=5025
  @2=4993
  @3='72162371109-65711525437-30164254657-02236716337-47638530925-52423543892-06270192544-11372615750-04017656641-19388264173'
  @4='44029122667-48848103638-83352868135-91599152925-97809617080'
# at 145248382

INSERT INTO `test`.`sbtest1`
SET
  @1=5025
  @2=5046
  @3='55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722'
  @4='90699347551-90936038435-69760642136-45340328341-67205199431'
# at 145248672
...

last_committed=89062    sequence_number=89065
UPDATE `test`.`sbtest1`
WHERE
  @1=5025
  @2=5046
  @3='55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722'
  @4='90699347551-90936038435-69760642136-45340328341-67205199431'
SET
  @1=5025
  @2=5047
  @3='55585975399-51936995975-90609908571-88981758242-41639509045-49015163211-63909390173-09873895014-17528416149-59787710722'
  @4='90699347551-90936038435-69760642136-45340328341-67205199431'
# at 145258107

So, the correct value is k=5047. But on _sbtest1_gho the order of the transactions is switched. First gh-ost updates the row to k=5047 then deletes and reinserts it with k=5046, resulting in wrong final value k=5046.

We can look at the dependency (sub)graph for the original transactions on sbtest1 (sequence numbers 89058 and 89065). It looks like this:

graph LR;
     89058--> 89053;
     89053--> 89050;
     89065--> 89062;
     89062--> 89053;
Loading

This means once transaction 89053 and 89062 finish, the coordinator may schedule 89058 and 89065 concurrently.

Comparing to what the MySQL replication applier coordinator does (sql/rpl_rli_pbd.cc), I realized that a transaction should be scheduled if and only if lowWaterMark >= lastCommitted.

The lastCommitted of a transaction is its most recent (i.e. greatest) dependent transaction. The lowWaterMark is a global variable that maintains the invariant that all sequence numbers <= lowWaterMark are complete. Therefore if we schedule a transaction when lastCommitted > lowWaterMark it is possible it has dependent transactions that haven't completed, even if lastCommitted is complete.

In the example, 89065 must wait until lowWaterMark >= 89062, at which point it's guaranteed that 89058 completed.

Fix

In our Coordinator, the culprit is this line in WaitForTransaction:

if _, ok := c.completedJobs[lastCommitted]; ok {
return nil
}

In the example it allowed 89065 to be applied after 89062, but before 89058 completed.

After removing these lines the sysbench localtest is consistently passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants