Troubleshooting Common Issues in Galera DB Clusters

Mydbops
Mar 6, 2026
10 min read
Mins to Read
All
Galera DB Clusters
Galera DB Clusters

High availability isn’t optional anymore. For telecom platforms, financial systems, and enterprise workloads, minutes of downtime can impact SLAs, revenue streams, and customer confidence.

MariaDB Galera Cluster addresses this by providing synchronous, multi-master replication, every node stays in sync, handles read/write traffic, and keeps your data consistent even when something goes wrong.

This guide covers how Galera works under the hood, the most common issues you'll encounter in real environments, and the exact logs and errors you'll see when troubleshooting them.

MariaDB Galera Cluster

MariaDB Galera Cluster is a synchronous multi-master replication system. Unlike traditional Source-Replica setups, every node can handle both reads and writes simultaneously.

Read: Setting Up MariaDB Galera Cluster from Scratch

Why use Galera?

Galera cluster capabilities
Active
No single point of failure
All nodes are active and can serve traffic simultaneously. Architecture remains robust even if individual components fail.
Automatic failover
If one node dies, the cluster detects it instantly. Traffic is routed to remaining nodes seamlessly without downtime.
Synchronous consistency
Guarantees data integrity. All nodes possess the exact same data at the same time via Write Set Replication.
Simple scaling
Add new nodes. It joins the cluster, syncs automatically, and starts serving.

Instead of asynchronous binlog-based replication, Galera uses Write Set Replication (wsrep). When a write happens on one node, the transaction is broadcast to all nodes and only committed once every node confirms it can apply it, ensuring consistency across the cluster.

Example flow:

Write on Node 1 → Certified by Node 2 & Node 3 → Committed on all nodes

Primary Receiver Writeset Writeset Commit Galera replication Received Queue Applier and committer threads Writes Writes Certification test Certification status Queued into Applied by

Status check:

SHOW STATUS LIKE 'wsrep_%';

For the full list of status variables, see the official reference.

Key variables to watch:

Galera Cluster Architecture

A typical MariaDB Galera Cluster consists of:

  • Multiple MariaDB nodes (mysqld) — each running with Galera support enabled
  • The wsrep API — the replication interface between MariaDB and Galera
  • Galera Provider (libgalera_smm.so) — handles node communication and write-set certification

Core configuration (/etc/my.cnf.d/server.cnf):

[mysqld]
binlog_format=ROW
default_storage_engine=InnoDB
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_name="mariadb_cluster"
wsrep_cluster_address="gcomm://node1,node2,node3"
wsrep_node_address="10.0.0.1"
wsrep_node_name="node1"
wsrep_sst_method=rsync
wsrep_sst_auth="sst_user:sst_pass"

Cluster health check:

SHOW STATUS LIKE 'wsrep_cluster%';

Expected healthy output:

wsrep_cluster_status = Primary
wsrep_cluster_size   = 3
wsrep_local_state_comment = Synced

Log example - Node joining successfully:

Nov 19 10:43:12 node2 mysqld[2865]: WSREP: Member 1.2 (node2) synced with group.
Nov 19 10:43:12 node2 mysqld[2865]: WSREP: Synchronized with group, ready for connections.

Common Startup and Node Join Issues

Starting a new node or bringing up the cluster for the first time can fail for multiple reasons.

Case 1: Node Fails to Join the Cluster

Symptom: Node stays stuck in JOINING or NON-PRIMARY state.

Log:

WSREP: Failed to open TCP connection to peer node1:4567
WSREP: Member 0.0 (node2) requested state transfer from '*any*'; failed

Fix:

  • Allow ports 4567 (replication), 4568 (IST), and 4444 (SST) through the firewall.
  • Verify wsrep_cluster_address and node IPs.
  • Confirm SST user credentials are consistent across all nodes.

Case 2: SST (State Snapshot Transfer) Failure

Log:

WSREP: SST failed: exit code 32
WSREP: State transfer required but no donor available

Fix:

  • Ensure the donor node has enough disk space.
  • Verify SST user privileges:
GRANT RELOAD, LOCK TABLES, PROCESS, SUPER ON *.* TO 'sst_user'@'%' IDENTIFIED BY 'sst_pass';

Case 3: Bootstrap Errors

Log:

WSREP: Failed to determine cluster address from configuration

Bootstrap the first node with:

galera_new_cluster

Or

gcom=://

Then start the remaining nodes normally:

systemctl start mariadb

Replication and Synchronization Problems

Even with synchronous replication, nodes can fall behind or desync temporarily.

Flow Control Active

Log:

WSREP: Flow control paused, waiting for a slow node.
WSREP: Paused writes for 1.2s due to high send queue.

What it means: One node is too slow; the others are waiting for it to catch up.

Fix:

  • Check CPU and disk I/O on the lagging node.
  • Investigate bottlenecks before they cascade.

Certification Failures

Log:

WSREP: Transaction failed due to certification error
WSREP: Aborting transaction (conflict detected)

What it means: Two nodes attempted to write to the same row at the same time.

Fix: Add retry logic in the application layer, reduce concurrent writes to the same rows, and ensure unique key constraints are in place.

Frequent Full SSTs Instead of IST

Log:

WSREP: IST not possible, falling back to SST

Fix: Increase GCache size:

wsrep_provider_options="gcache.size=2G"

Network-Related Issues

Galera replication depends on fast, reliable network communication. Any disruption can cause serious cluster instability.

Ports Reference:

Port
Usage
4567
Galera replication traffic. Handles group communication and certification.
4568
Incremental State Transfer (IST). Used for catching up nodes via GCache.
4444
State Snapshot Transfer (SST). Required for full node synchronization and new joins.

Split Brain

Log:

WSREP: Quorum not reached for primary component
WSREP: Cluster size (1) < quorum (2)

Fix:

  • Always deploy an odd number of nodes (3, 5, …).
  • When a network partition occurs, only the majority partition remains writable by design.

Packet Loss / Communication Failure

Log:

WSREP: communication failure detected, closing group

Fix: Check for firewall rules or NAT configurations that may be rewriting or dropping replication packets.

Transaction Conflicts and Deadlocks

Write conflicts are expected in a write-anywhere cluster. The key is handling them properly.

TXN_CERTIFICATION_FAILURE

Logic: Optimistic Locking Conflict Detected

N1
UPDATE
row_ID: 99
! CONFLICT
N2
UPDATE
row_ID: 99
[ERROR] WSREP: Transaction failed due to certification error (Deadlock found)

Conflict Log:

WSREP: transaction cannot be certified
WSREP: Transaction failed due to write-set conflict

Fix:

  • Direct writes to a single primary where possible.
  • Keep transactions short.
  • Avoid updating the same rows from multiple nodes simultaneously.
  • Implement retry logic at the application layer.

Deadlock Log:

InnoDB: Deadlock found when trying to get lock
WSREP: Transaction rolled back due to certification failure

Recommendation: Use retry loops or optimistic concurrency patterns in your application code.

Performance and Load Balancing Issues

If even one node is slow, the entire cluster feels it:

WSREP: Flow control active for 2000ms

Monitor with:

SHOW STATUS LIKE 'wsrep_flow_control%';

Load Balancing Tips:

  • Use MaxScale or ProxySQL to distribute traffic.
  • Route reads across all nodes, but keep write routing consistent.
  • Avoid overloading donor nodes while SST is in progress.

Tuning Recommendations:

innodb_flush_log_at_trx_commit=2
sync_binlog=0
wsrep_slave_threads=8 

Node Failures and Recovery

When a node fails, Galera handles it gracefully — provided quorum is maintained.

Node Failure Log:

WSREP: Member node3 left the cluster (TCP connection lost)
WSREP: Primary component reorganized: 2 nodes left

The cluster continues operating as long as the majority of nodes are alive.

Recovery method by scenario:

When you restart a node:

  1. If within GCache window → uses IST (fast)
  2. If too old → performs SST (slow, full copy)
Galera Cache (GCache) Logic
Ring Buffer Usage: Data vs. Capacity
Head (Newest)
Node Desync State
Syncing...
IST MODE

Green Ring: Available Transactions.
If the joining node needs data that has fallen off the ring (buffer overflow), it triggers a full SST.

IST Log:

WSREP: Starting IST from node1
WSREP: IST received 2500 transactions

SST Log:

WSREP: Donor: node1, SST method: rsync
WSREP: SST complete, node joined cluster

Security and Configuration Errors

Common Mistakes:

  • Using non-InnoDB storage engines (not supported by Galera)
  • Missing primary keys on tables (required for row-based replication)
  • Mismatched my.cnf values across nodes

Secure Configuration Example:

[mysqld]
wsrep_provider_options="cert.log_conflicts=ON"
wsrep_sst_auth="sst_user:sst_pass"
bind-address=0.0.0.0

SST Authentication Failure Log:

WSREP: Access denied for user 'sst_user'@'10.0.0.2'
WSREP: SST failed, donor refused connection

Fix: Check that the SST user exists on all nodes with the correct password and privileges.

Wrapping Up

MariaDB Galera Cluster delivers real high availability, consistency, and fault tolerance but only when it's configured and maintained correctly. Understanding replication mechanics, reading wsrep logs accurately, and keeping performance balanced across nodes are what separate a stable cluster from a fragile one.

With the right setup, Galera keeps your data consistent, your nodes healthy, and your applications running without interruption.

Need Help With Your MariaDB Setup?

From setup and Galera troubleshooting to 24/7 Remote DBA support and comprehensive security audits, our experts manage, optimize, and protect your database environment end-to-end.

No items found.

About the Author

Subscribe Now!

Subscribe here to get exclusive updates on upcoming webinars, meetups, and to receive instant updates on new database technologies.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.