ISPConfig update issue - nameserver not working

Discussion in 'ISPConfig 3 Priority Support' started by pyte, Nov 24, 2023.

  1. pyte

    pyte Well-Known Member HowtoForge Supporter

    Hi,
    i've just updated our multiserver setup this morning. From 3.2.9p1 to 3.2.11p1 and a short while later my nameservers started to act up and denied all queries, but let me start with the setup itself and what i did.

    The multiserver setup: panel, mail01, mail02(mirror), ns01, ns02(mirror), db01, web01, web02
    Update path:
    • actiavted maintenance mode
    • Updates the panel
    1. ispconfig_update.sh
    2. stable
    3. Reconfigure Permissions: no
    4. Reconfigure Services: yes
    5. Create new SSL Certs: no
    6. Reconfigure Crontab: yes
    • Update all slaves (mains before mirrors)
    1. ispconfig_update.sh
    2. stable
    3. Reconfigure Permissions in master: no (only yes on the last slave which was ns02)
    4. Reconfigure Services: yes
    5. Create new Cers: no
    6. Reconfiger Crontab: yes
    • Disable maintenance mode
    If checked services afterwards, especially bind9 as this is the only part that is running in production right now. Service was OK and after checking with a quick dig @ns01/ns02 for 2-3 Testrecords all was good.
    Around 40-60 minutes later the first calls came in that things started to act up and services are not working. So we checked the nameserver and surprise surprise:
    Code:
    query (cache) 'XXXXXXX/A/IN' denied
    For all queries i got a denied message in the log which i could validate with dig. Checking the nameservers, all zonefiles where still under /etc/bind/. However the named.conf.local only listed 2 zones. Checking the mirror and again only 3-4 zones listed. As you can imagine on a production system with around 1200 zones panic set it.

    Not knowing what happend i startet to use the resync tool and the job queue filled with 10k+ changes (9k records, 1k zone changes). However it handeld 3-4 jobs every 3 sec. This was too slow so i decided to restore ns01 to the state before the update, took ~2min. Server was back up and DNS queries got answered.
    The jobs still in the queue, but now at a rate of 150jobs/sec. After the jobqueue was finished i check the servers. Everything is ok. ns01 is now on version 3.2.9p1 everything else on 3.2.11p1. I'm am scared, i should not have done updates without a maintenance window and not on a friday ffs.

    I didn't realize at the time of the incident that i could have just grabbed the named.conf.local from /var/backup and would have had at least a working nameserver.

    Which brings me to the question. What did just happen? Any ideas?
     
  2. Th0m

    Th0m ISPConfig Developer Staff Member ISPConfig Developer

    Never deploy on a friday - that goes for updates as well ;)

    You might want to try to update and let it reconfigure database permissions - there have been some table changes around DNS, I have seen some trouble that has been resolved by reconfiguring database permissions.

    To dive further into it, we would need to run the server.sh cron with debug mode on after you update. If you want I'd be willing to look over your shoulder while upgrading the servers so we can act quickly and resolve it instead of rolling back (but rolling back if required).
     
  3. pyte

    pyte Well-Known Member HowtoForge Supporter

    I've already talked to till and checked some stuff. But still a bit stumbed as about what happend.

    I may come back to this offer sometime next week if thats ok with you.
     
    Th0m likes this.

Share This Page