Server1 (id=1) has a mirror; Server2 (id=11). Since a month ago, or so, when I cleaned of old databases and dbusers I got two jobs stuck, and they have been stuck since then. Today I noticed six additional stuck jobs. I've updated both to latest versions. The following changes are not yet populated to all servers: Update main config: 6 Delete database user: 2 Normal changes seem to be working, when I make a change in the ISPConfig GUI and check the debug the new job is processed and the ballon value is lowered to 8 again. Code: root@server2 # /usr/local/ispconfig/server/server.sh Set Lock: /usr/local/ispconfig/server/temp/.ispconfig_lock 18.11.2024-14:48 - DEBUG [plugins.inc:155] - Calling function 'check_phpini_changes' from plugin 'webserver_plugin' raised by action 'server_plugins_loaded'. 18.11.2024-14:48 - DEBUG [server:224] - Remove Lock: /usr/local/ispconfig/server/temp/.ispconfig_lock finished server.php. From the sys_datalog I think the following are in the queue: datalogid, server_id, dbtable, dbidx, (no errors) 94407 0 sys_ini sysini_id:1 94409 0 sys_ini sysini_id:1 94405 0 sys_ini sysini_id:1 94403 0 sys_ini sysini_id:1 94401 0 sys_ini sysini_id:1 94399 0 sys_ini sysini_id:1 serverid, updated 1, 94411 11, 94351 From the forum I've read that sys_datalog server_id = 0 is the broadcast, but as serverid 11 should be a mirror it should have updated everything automatically? I have mysql database-replication on so I suspect at least the database could have had been removed before the job was processed, but then again, if so, I would expect that to happen more often. But now that there are six new jobs, I'm out of clues.
Is server 11 running and fetching jobs from the queue correctly? Either server 11 is not able to fetch changes from master, or it is not able to update the value in the 'updated' column to the last processed datalog_id anymore.
I've had a similar problem. Turned out node X lost it's connection to ISPC master db. So jobs queued up on the master node for node X while other nodes did their job correctly. So for short: verify that your database connections on all other nodes to the master node are operational. My guess is at least one is not.
I guess we have to build a tool to diagnose such issues more easily and add an alert, E.g., when there has been no new monitoring data for a node for a certain amount of time.
Yes, thank you, you are correct. The error is on my side. For some reason the my monitoring hadn't noticed that the replication had stalled (some WP options transient duplicate keys). Now it is catching up, two of the jobs are gone.