Hi, i've just updated our multiserver setup this morning. From 3.2.9p1 to 3.2.11p1 and a short while later my nameservers started to act up and denied all queries, but let me start with the setup itself and what i did. The multiserver setup: panel, mail01, mail02(mirror), ns01, ns02(mirror), db01, web01, web02 Update path: actiavted maintenance mode Updates the panel ispconfig_update.sh stable Reconfigure Permissions: no Reconfigure Services: yes Create new SSL Certs: no Reconfigure Crontab: yes Update all slaves (mains before mirrors) ispconfig_update.sh stable Reconfigure Permissions in master: no (only yes on the last slave which was ns02) Reconfigure Services: yes Create new Cers: no Reconfiger Crontab: yes Disable maintenance mode If checked services afterwards, especially bind9 as this is the only part that is running in production right now. Service was OK and after checking with a quick dig @ns01/ns02 for 2-3 Testrecords all was good. Around 40-60 minutes later the first calls came in that things started to act up and services are not working. So we checked the nameserver and surprise surprise: Code: query (cache) 'XXXXXXX/A/IN' denied For all queries i got a denied message in the log which i could validate with dig. Checking the nameservers, all zonefiles where still under /etc/bind/. However the named.conf.local only listed 2 zones. Checking the mirror and again only 3-4 zones listed. As you can imagine on a production system with around 1200 zones panic set it. Not knowing what happend i startet to use the resync tool and the job queue filled with 10k+ changes (9k records, 1k zone changes). However it handeld 3-4 jobs every 3 sec. This was too slow so i decided to restore ns01 to the state before the update, took ~2min. Server was back up and DNS queries got answered. The jobs still in the queue, but now at a rate of 150jobs/sec. After the jobqueue was finished i check the servers. Everything is ok. ns01 is now on version 3.2.9p1 everything else on 3.2.11p1. I'm am scared, i should not have done updates without a maintenance window and not on a friday ffs. I didn't realize at the time of the incident that i could have just grabbed the named.conf.local from /var/backup and would have had at least a working nameserver. Which brings me to the question. What did just happen? Any ideas?
Never deploy on a friday - that goes for updates as well You might want to try to update and let it reconfigure database permissions - there have been some table changes around DNS, I have seen some trouble that has been resolved by reconfiguring database permissions. To dive further into it, we would need to run the server.sh cron with debug mode on after you update. If you want I'd be willing to look over your shoulder while upgrading the servers so we can act quickly and resolve it instead of rolling back (but rolling back if required).
I've already talked to till and checked some stuff. But still a bit stumbed as about what happend. I may come back to this offer sometime next week if thats ok with you.