MariaDB Galera Cluster Troubleshooting Guide

Troubleshooting Common Issues in Galera DB Clusters

Mar 6, 2026

Mins to Read

All

High availability isn’t optional anymore. For telecom platforms, financial systems, and enterprise workloads, minutes of downtime can impact SLAs, revenue streams, and customer confidence.

MariaDB Galera Cluster addresses this by providing synchronous, multi-master replication, every node stays in sync, handles read/write traffic, and keeps your data consistent even when something goes wrong.

This guide covers how Galera works under the hood, the most common issues you'll encounter in real environments, and the exact logs and errors you'll see when troubleshooting them.

MariaDB Galera Cluster

MariaDB Galera Cluster is a synchronous multi-master replication system. Unlike traditional Source-Replica setups, every node can handle both reads and writes simultaneously.

Read: Setting Up MariaDB Galera Cluster from Scratch

Why use Galera?

Galera cluster capabilities

Active

No single point of failure

All nodes are active and can serve traffic simultaneously. Architecture remains robust even if individual components fail.

Automatic failover

If one node dies, the cluster detects it instantly. Traffic is routed to remaining nodes seamlessly without downtime.

Synchronous consistency

Guarantees data integrity. All nodes possess the exact same data at the same time via Write Set Replication.

Simple scaling

Add new nodes. It joins the cluster, syncs automatically, and starts serving.

Instead of asynchronous binlog-based replication, Galera uses Write Set Replication (wsrep). When a write happens on one node, the transaction is broadcast to all nodes and only committed once every node confirms it can apply it, ensuring consistency across the cluster.

Example flow:

Write on Node 1 → Certified by Node 2 & Node 3 → Committed on all nodes

‍Status check:

SHOW STATUS LIKE 'wsrep_%';

For the full list of status variables, see the official reference.

Key variables to watch:

wsrep_local_state_comment — e.g., Synced, Donor/Desynced, Joining
wsrep_cluster_status
wsrep_cluster_size

Galera Cluster Architecture

A typical MariaDB Galera Cluster consists of:

Multiple MariaDB nodes (mysqld) — each running with Galera support enabled
The wsrep API — the replication interface between MariaDB and Galera
Galera Provider (libgalera_smm.so) — handles node communication and write-set certification

Core configuration (/etc/my.cnf.d/server.cnf):‍

[mysqld]
binlog_format=ROW
default_storage_engine=InnoDB
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_name="mariadb_cluster"
wsrep_cluster_address="gcomm://node1,node2,node3"
wsrep_node_address="10.0.0.1"
wsrep_node_name="node1"
wsrep_sst_method=rsync
wsrep_sst_auth="sst_user:sst_pass"

Cluster health check:

SHOW STATUS LIKE 'wsrep_cluster%';

Expected healthy output:‍

wsrep_cluster_status = Primary
wsrep_cluster_size   = 3
wsrep_local_state_comment = Synced

‍Log example - Node joining successfully:‍

Nov 19 10:43:12 node2 mysqld[2865]: WSREP: Member 1.2 (node2) synced with group.
Nov 19 10:43:12 node2 mysqld[2865]: WSREP: Synchronized with group, ready for connections.

`‍`Common Startup and Node Join Issues

Starting a new node or bringing up the cluster for the first time can fail for multiple reasons.

Case 1: Node Fails to Join the Cluster

Symptom: Node stays stuck in JOINING or NON-PRIMARY state.

Log:‍

WSREP: Failed to open TCP connection to peer node1:4567
WSREP: Member 0.0 (node2) requested state transfer from '*any*'; failed

‍Fix:

Allow ports 4567 (replication), 4568 (IST), and 4444 (SST) through the firewall.
Verify wsrep_cluster_address and node IPs.
Confirm SST user credentials are consistent across all nodes.

Case 2: SST (State Snapshot Transfer) Failure

Log:‍

WSREP: SST failed: exit code 32
WSREP: State transfer required but no donor available

‍Fix:

Ensure the donor node has enough disk space.
Verify SST user privileges:

GRANT RELOAD, LOCK TABLES, PROCESS, SUPER ON *.* TO 'sst_user'@'%' IDENTIFIED BY 'sst_pass';

Case 3: Bootstrap Errors

Log:

WSREP: Failed to determine cluster address from configuration

Bootstrap the first node with:

galera_new_cluster

Or

gcom=://

Then start the remaining nodes normally:

systemctl start mariadb

Replication and Synchronization Problems

Even with synchronous replication, nodes can fall behind or desync temporarily.

Flow Control Active

Log:

WSREP: Flow control paused, waiting for a slow node.
WSREP: Paused writes for 1.2s due to high send queue.

‍‍What it means: One node is too slow; the others are waiting for it to catch up.

Fix:

Check CPU and disk I/O on the lagging node.
Investigate bottlenecks before they cascade.

Certification Failures

Log:‍

WSREP: Transaction failed due to certification error
WSREP: Aborting transaction (conflict detected)

‍What it means: Two nodes attempted to write to the same row at the same time.

Fix: Add retry logic in the application layer, reduce concurrent writes to the same rows, and ensure unique key constraints are in place.

Frequent Full SSTs Instead of IST

Log:

WSREP: IST not possible, falling back to SST

Fix: Increase GCache size:

wsrep_provider_options="gcache.size=2G"

Network-Related Issues

Galera replication depends on fast, reliable network communication. Any disruption can cause serious cluster instability.

Ports Reference:

Port

Usage

4567

Galera replication traffic. Handles group communication and certification.

4568

Incremental State Transfer (IST). Used for catching up nodes via GCache.

4444

State Snapshot Transfer (SST). Required for full node synchronization and new joins.

Split Brain

Log:

WSREP: Quorum not reached for primary component
WSREP: Cluster size (1) < quorum (2)

‍Fix:

Always deploy an odd number of nodes (3, 5, …).
When a network partition occurs, only the majority partition remains writable by design.

Packet Loss / Communication Failure

Log:

WSREP: communication failure detected, closing group

Fix: Check for firewall rules or NAT configurations that may be rewriting or dropping replication packets.

Transaction Conflicts and Deadlocks

Write conflicts are expected in a write-anywhere cluster. The key is handling them properly.

TXN_CERTIFICATION_FAILURE

Logic: Optimistic Locking Conflict Detected

UPDATE
row_ID: 99

[ERROR] WSREP: Transaction failed due to certification error (Deadlock found)

Conflict Log:‍

WSREP: transaction cannot be certified
WSREP: Transaction failed due to write-set conflict

‍Fix:

Direct writes to a single primary where possible.
Keep transactions short.
Avoid updating the same rows from multiple nodes simultaneously.
Implement retry logic at the application layer.

Deadlock Log:‍

InnoDB: Deadlock found when trying to get lock
WSREP: Transaction rolled back due to certification failure

‍Recommendation: Use retry loops or optimistic concurrency patterns in your application code.

Performance and Load Balancing Issues

If even one node is slow, the entire cluster feels it:

WSREP: Flow control active for 2000ms

Monitor with:

SHOW STATUS LIKE 'wsrep_flow_control%';

Load Balancing Tips:

Use MaxScale or ProxySQL to distribute traffic.
Route reads across all nodes, but keep write routing consistent.
Avoid overloading donor nodes while SST is in progress.

Tuning Recommendations:‍

innodb_flush_log_at_trx_commit=2
sync_binlog=0
wsrep_slave_threads=8

`‍`Node Failures and Recovery

When a node fails, Galera handles it gracefully — provided quorum is maintained.

Node Failure Log:‍

WSREP: Member node3 left the cluster (TCP connection lost)
WSREP: Primary component reorganized: 2 nodes left

‍The cluster continues operating as long as the majority of nodes are alive.

Recovery method by scenario:

When you restart a node:

If within GCache window → uses IST (fast)
If too old → performs SST (slow, full copy)

Galera Cache (GCache) Logic

Ring Buffer Usage: Data vs. Capacity

Node Desync State

Syncing...

IST MODE

Green Ring: Available Transactions.
If the joining node needs data that has fallen off the ring (buffer overflow), it triggers a full SST.

IST Log:‍

WSREP: Starting IST from node1
WSREP: IST received 2500 transactions

SST Log:

WSREP: Donor: node1, SST method: rsync
WSREP: SST complete, node joined cluster

‍`‍`Security and Configuration Errors

Common Mistakes:

Using non-InnoDB storage engines (not supported by Galera)
Missing primary keys on tables (required for row-based replication)
Mismatched my.cnf values across nodes

Secure Configuration Example:‍

[mysqld]
wsrep_provider_options="cert.log_conflicts=ON"
wsrep_sst_auth="sst_user:sst_pass"
bind-address=0.0.0.0

‍SST Authentication Failure Log:‍

WSREP: Access denied for user 'sst_user'@'10.0.0.2'
WSREP: SST failed, donor refused connection

‍Fix: Check that the SST user exists on all nodes with the correct password and privileges.

Wrapping Up

MariaDB Galera Cluster delivers real high availability, consistency, and fault tolerance but only when it's configured and maintained correctly. Understanding replication mechanics, reading wsrep logs accurately, and keeping performance balanced across nodes are what separate a stable cluster from a fragile one.

With the right setup, Galera keeps your data consistent, your nodes healthy, and your applications running without interruption.

Need Help With Your MariaDB Setup?

From setup and Galera troubleshooting to 24/7 Remote DBA support and comprehensive security audits, our experts manage, optimize, and protect your database environment end-to-end.

Talk to MariaDB Experts

No items found.

Troubleshooting Common Issues in Galera DB Clusters

MariaDB Galera Cluster

Galera Cluster Architecture

`‍`Common Startup and Node Join Issues

Case 1: Node Fails to Join the Cluster

Case 2: SST (State Snapshot Transfer) Failure

Case 3: Bootstrap Errors

Replication and Synchronization Problems

Flow Control Active

Certification Failures

Frequent Full SSTs Instead of IST

Network-Related Issues

Split Brain

Packet Loss / Communication Failure

Transaction Conflicts and Deadlocks

TXN_CERTIFICATION_FAILURE

Performance and Load Balancing Issues

`‍`Node Failures and Recovery

‍`‍`Security and Configuration Errors

Wrapping Up

Need Help With Your MariaDB Setup?

About the Author

Topic covered

Continue Reading

MySQL 8.0 to 8.4 LTS Migration Checklist: A Step-by-Step Upgrade Guide

MySQL 8.0 to 8.4 LTS Migration Checklist: A Step-by-Step Upgrade Guide

Parameter Sensitive Plan (PSP) Issues in SQL Server

Parameter Sensitive Plan (PSP) Issues in SQL Server

Why Identical MySQL Backups Have Different Sizes: An XtraBackup Investigation

Why Identical MySQL Backups Have Different Sizes: An XtraBackup Investigation

MySQL 8.4 Upgrade Guide: Navigating Major Incompatible Changes

MySQL 8.4 Upgrade Guide: Navigating Major Incompatible Changes

Subscribe Now!

Troubleshooting Common Issues in Galera DB Clusters

MariaDB Galera Cluster

Galera Cluster Architecture

‍Common Startup and Node Join Issues

Case 1: Node Fails to Join the Cluster

Case 2: SST (State Snapshot Transfer) Failure

Case 3: Bootstrap Errors

Replication and Synchronization Problems

Flow Control Active

Certification Failures

Frequent Full SSTs Instead of IST

Network-Related Issues

Split Brain

Packet Loss / Communication Failure

Transaction Conflicts and Deadlocks

TXN_CERTIFICATION_FAILURE

Performance and Load Balancing Issues

‍Node Failures and Recovery

‍‍Security and Configuration Errors

Wrapping Up

Need Help With Your MariaDB Setup?

About the Author

Topic covered

Continue Reading

MySQL 8.0 to 8.4 LTS Migration Checklist: A Step-by-Step Upgrade Guide

MySQL 8.0 to 8.4 LTS Migration Checklist: A Step-by-Step Upgrade Guide

Parameter Sensitive Plan (PSP) Issues in SQL Server

Parameter Sensitive Plan (PSP) Issues in SQL Server

Why Identical MySQL Backups Have Different Sizes: An XtraBackup Investigation

Why Identical MySQL Backups Have Different Sizes: An XtraBackup Investigation

MySQL 8.4 Upgrade Guide: Navigating Major Incompatible Changes

MySQL 8.4 Upgrade Guide: Navigating Major Incompatible Changes

Subscribe Now!

`‍`Common Startup and Node Join Issues

`‍`Node Failures and Recovery

‍`‍`Security and Configuration Errors