

Handling Long Duration SST Timeouts in Percona XtraDB and MariaDB Clusters with systemd
Encountering mysterious SST failures during Percona XtraDB Cluster (PXC) deployments? The culprit might be a systemd timeout limitation. This blog post by Mydbops, your MySQL specialists, dives deep into understanding and overcoming SST timeouts in systemd for both PXC and MariaDB Cluster.
We will be explaining about the timeouts in SST on systemd implementation which we faced recently in Percona XtraDB Cluster during our Consulting with a client. State Snapshot Transfers (SST) refers to complete data sync from one of the nodes from the cluster to the joining node.
Why Default Systemd Timeouts Can Cause SST Failures
SST will happen for one or more reasons listed below.
- Initial sync to join a node to cluster.
- Node is out of cluster and lost its ability to join back due to data corruption or inconsistencies and also when the node went far behind the node, Starting point of recovery from gcache (Where recovery logs are written) is purged or rotated.
It’s very important to understand the timeout related to SST as in a large size cluster implementation, Where it’s going to take hours to complete the SST. If it fails on timeout in mid it can ruin your day.
We will be looking for SST timeouts on two large scale galera cluster implementations, Percona XtraDB Cluster and MariaDB Cluster with the systemd startup process.
Overcoming SST Timeouts in PXC and MariaDB Cluster
Percona XtraDB Cluster (PXC)
PXC Version: 5.6.38
Systemd Service Script: /usr/lib/systemd/system/mysql.service
When the nodes goes for SST, Startup script will be waiting on ExecStartPost to give OK.
- We can see, post check script calls /usr/bin/mysql-systemd with argument start-post, It goes through the below switch case call.
- Inside start-post, wait_for_pid function is invoked with argument created and pid path. Script will then be looping through wait_for_pid function until the SST completes.
- Just pasting the code related to this discussion from the function wait_for_pid.
This while loop tries for service_startup_timeout number of times, Each time it waits for startup_sleep of 10 sec, The value for service_startup_timeout is hardcoded in the script as 900.
- So, SST will only wait for only 900 * 10 = 9000 Seconds = 2 hrs 30 min to complete on systemd implementation and It timeout after that.
- For a cluster of huge size, Its’ a bottleneck, For a bigger data set SST can take more time, Failing in middle is very bad thing that can happen. Error it throws when such event happens is misleading and it’s not clear.
Testing:In our testing with PXC Version: 5.6.38 and OS: Centos 7 of data set 1.5 TB, SST timed out in middle when almost 700G copied in approx. 2 hrs 30 min.
Error Logs:
Joiner:
Donor:
SST Duration: 19:13:04 – 21:42:44 ~ Timeout In 2 hrs 30 min
Solutions:
Method 1:
- Edit /usr/bin/mysql-systemd file and set service_startup_timeout from 900 to a much higher value. In our case, We have set it to 8 hours (2880).
 (2880*10)/60/60 = 8 hrs
# sed -i ‘s/service_startup_timeout=900/service_startup_timeout=2880/g’ /usr/bin/mysql-systemd
Method 2:
- On /usr/bin/mysql-systemd, We can see it is also reading this variable from mysqld_safe tag
service_startup_timeout=$(parse_cnf service-startup-timeout $service_startup_timeout mysqld_safe)
- It’s mentioned on the mysql.service script mysql.service, But it’s not clear.
- So we can also define service_startup_timeout variable in the /etc/my.cnf under [mysqld_safe] tag
- Variable under /etc/my.cnf takes higher precedence.
This behaviour is reported to Percona team: https://jira.percona.com/browse/PXC-2080
MariaDB Cluster
MariaDB Version: 10.1.31
MariaDB has provided clear information on how to increase the timeout for SST in their documentation for upgrading.
On Linux distributions that use #systemd# you man need to increase the service startup timeout as the default timeout of 20 minutes may not be sufficient if a SST becomes necessary
- create a file /etc/systemd/system/mariadb.service.d/timeout.conf with the following data.
 [Service]
 TimeoutSec=infinity
- If you are using a systemd version older than version 229 you have to replace infinity with 0
- Execute # systemctl daemon-reload after the change for the new timeout setting to take effect.
It’s also very interesting that, It has provided with very good documentation on the systemd startup script and variable details. you can read at the following link.
https://mariadb.com/kb/en/library/systemd/
mariadb-service-convert script to generate the systemd startup script variables from /etc/my.cnf is just fascinating. I m not going into much details on that as it’s out of the scope for this blog. I really admire the fact, the documentation is very clear.
Key Takeaways:
SST on systemd implementation has timeouts.
Percona XtraDB Cluster it’s 2 hours 30 minutes.
MariaDB Cluster it’s 20 minutes.
If your data copy during SST is going take more than that, Use the solutions provided to avoid surprises in the production.
SST timeouts can disrupt your database scaling plans, regardless of whether you're using Percona XtraDB Cluster (PXC) or MariaDB Cluster. Mydbops offers comprehensive MySQL and MariaDB management services to help you navigate these complexities for both platforms:
Focus on your core business while Mydbops safeguards your database scaling journey, whether you're using MySQL or MariaDB. Contact us today for a free consultation!
{{cta}}








.jpeg)