.avif)
.avif)
High availability isn’t optional anymore. For telecom platforms, financial systems, and enterprise workloads, minutes of downtime can impact SLAs, revenue streams, and customer confidence.
MariaDB Galera Cluster addresses this by providing synchronous, multi-master replication, every node stays in sync, handles read/write traffic, and keeps your data consistent even when something goes wrong.
This guide covers how Galera works under the hood, the most common issues you'll encounter in real environments, and the exact logs and errors you'll see when troubleshooting them.
MariaDB Galera Cluster
MariaDB Galera Cluster is a synchronous multi-master replication system. Unlike traditional Source-Replica setups, every node can handle both reads and writes simultaneously.
Read: Setting Up MariaDB Galera Cluster from Scratch
Why use Galera?
Instead of asynchronous binlog-based replication, Galera uses Write Set Replication (wsrep). When a write happens on one node, the transaction is broadcast to all nodes and only committed once every node confirms it can apply it, ensuring consistency across the cluster.
Example flow:
Write on Node 1 → Certified by Node 2 & Node 3 → Committed on all nodes
Status check:
SHOW STATUS LIKE 'wsrep_%';For the full list of status variables, see the official reference.
Key variables to watch:
wsrep_local_state_comment— e.g., Synced, Donor/Desynced, Joiningwsrep_cluster_statuswsrep_cluster_size
Galera Cluster Architecture
A typical MariaDB Galera Cluster consists of:
- Multiple MariaDB nodes (mysqld) — each running with Galera support enabled
- The wsrep API — the replication interface between MariaDB and Galera
- Galera Provider (libgalera_smm.so) — handles node communication and write-set certification
Core configuration (/etc/my.cnf.d/server.cnf):
[mysqld]
binlog_format=ROW
default_storage_engine=InnoDB
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_name="mariadb_cluster"
wsrep_cluster_address="gcomm://node1,node2,node3"
wsrep_node_address="10.0.0.1"
wsrep_node_name="node1"
wsrep_sst_method=rsync
wsrep_sst_auth="sst_user:sst_pass"Cluster health check:
SHOW STATUS LIKE 'wsrep_cluster%';Expected healthy output:
wsrep_cluster_status = Primary
wsrep_cluster_size = 3
wsrep_local_state_comment = SyncedLog example - Node joining successfully:
Nov 19 10:43:12 node2 mysqld[2865]: WSREP: Member 1.2 (node2) synced with group.
Nov 19 10:43:12 node2 mysqld[2865]: WSREP: Synchronized with group, ready for connections.Common Startup and Node Join Issues
Starting a new node or bringing up the cluster for the first time can fail for multiple reasons.
Case 1: Node Fails to Join the Cluster
Symptom: Node stays stuck in JOINING or NON-PRIMARY state.
Log:
WSREP: Failed to open TCP connection to peer node1:4567
WSREP: Member 0.0 (node2) requested state transfer from '*any*'; failedFix:
- Allow ports 4567 (replication), 4568 (IST), and 4444 (SST) through the firewall.
- Verify
wsrep_cluster_addressand node IPs. - Confirm SST user credentials are consistent across all nodes.
Case 2: SST (State Snapshot Transfer) Failure
Log:
WSREP: SST failed: exit code 32
WSREP: State transfer required but no donor availableFix:
- Ensure the donor node has enough disk space.
- Verify SST user privileges:
GRANT RELOAD, LOCK TABLES, PROCESS, SUPER ON *.* TO 'sst_user'@'%' IDENTIFIED BY 'sst_pass';Case 3: Bootstrap Errors
Log:
WSREP: Failed to determine cluster address from configurationBootstrap the first node with:
galera_new_cluster
Or
gcom=://Then start the remaining nodes normally:
systemctl start mariadbReplication and Synchronization Problems
Even with synchronous replication, nodes can fall behind or desync temporarily.
Flow Control Active
Log:
WSREP: Flow control paused, waiting for a slow node.
WSREP: Paused writes for 1.2s due to high send queue.What it means: One node is too slow; the others are waiting for it to catch up.
Fix:
- Check CPU and disk I/O on the lagging node.
- Investigate bottlenecks before they cascade.
Certification Failures
Log:
WSREP: Transaction failed due to certification error
WSREP: Aborting transaction (conflict detected)What it means: Two nodes attempted to write to the same row at the same time.
Fix: Add retry logic in the application layer, reduce concurrent writes to the same rows, and ensure unique key constraints are in place.
Frequent Full SSTs Instead of IST
Log:
WSREP: IST not possible, falling back to SSTFix: Increase GCache size:
wsrep_provider_options="gcache.size=2G"Network-Related Issues
Galera replication depends on fast, reliable network communication. Any disruption can cause serious cluster instability.
Ports Reference:
Split Brain
Log:
WSREP: Quorum not reached for primary component
WSREP: Cluster size (1) < quorum (2)Fix:
- Always deploy an odd number of nodes (3, 5, …).
- When a network partition occurs, only the majority partition remains writable by design.
Packet Loss / Communication Failure
Log:
WSREP: communication failure detected, closing groupFix: Check for firewall rules or NAT configurations that may be rewriting or dropping replication packets.
Transaction Conflicts and Deadlocks
Write conflicts are expected in a write-anywhere cluster. The key is handling them properly.
Conflict Log:
WSREP: transaction cannot be certified
WSREP: Transaction failed due to write-set conflictFix:
- Direct writes to a single primary where possible.
- Keep transactions short.
- Avoid updating the same rows from multiple nodes simultaneously.
- Implement retry logic at the application layer.
Deadlock Log:
InnoDB: Deadlock found when trying to get lock
WSREP: Transaction rolled back due to certification failureRecommendation: Use retry loops or optimistic concurrency patterns in your application code.
Performance and Load Balancing Issues
If even one node is slow, the entire cluster feels it:
WSREP: Flow control active for 2000msMonitor with:
SHOW STATUS LIKE 'wsrep_flow_control%';Load Balancing Tips:
- Use MaxScale or ProxySQL to distribute traffic.
- Route reads across all nodes, but keep write routing consistent.
- Avoid overloading donor nodes while SST is in progress.
Tuning Recommendations:
innodb_flush_log_at_trx_commit=2
sync_binlog=0
wsrep_slave_threads=8 Node Failures and Recovery
When a node fails, Galera handles it gracefully — provided quorum is maintained.
Node Failure Log:
WSREP: Member node3 left the cluster (TCP connection lost)
WSREP: Primary component reorganized: 2 nodes leftThe cluster continues operating as long as the majority of nodes are alive.
Recovery method by scenario:
When you restart a node:
- If within GCache window → uses IST (fast)
- If too old → performs SST (slow, full copy)
IST Log:
WSREP: Starting IST from node1
WSREP: IST received 2500 transactionsSST Log:
WSREP: Donor: node1, SST method: rsync
WSREP: SST complete, node joined clusterSecurity and Configuration Errors
Common Mistakes:
- Using non-InnoDB storage engines (not supported by Galera)
- Missing primary keys on tables (required for row-based replication)
- Mismatched
my.cnfvalues across nodes
Secure Configuration Example:
[mysqld]
wsrep_provider_options="cert.log_conflicts=ON"
wsrep_sst_auth="sst_user:sst_pass"
bind-address=0.0.0.0SST Authentication Failure Log:
WSREP: Access denied for user 'sst_user'@'10.0.0.2'
WSREP: SST failed, donor refused connectionFix: Check that the SST user exists on all nodes with the correct password and privileges.
Wrapping Up
MariaDB Galera Cluster delivers real high availability, consistency, and fault tolerance but only when it's configured and maintained correctly. Understanding replication mechanics, reading wsrep logs accurately, and keeping performance balanced across nodes are what separate a stable cluster from a fragile one.
With the right setup, Galera keeps your data consistent, your nodes healthy, and your applications running without interruption.
Need Help With Your MariaDB Setup?
From setup and Galera troubleshooting to 24/7 Remote DBA support and comprehensive security audits, our experts manage, optimize, and protect your database environment end-to-end.

.avif)

.avif)

.avif)
