Too many fstab mounts causing network name resolution timeout

Discussion in 'General' started by trancenoid, Mar 6, 2023.

  1. trancenoid

    trancenoid New Member

    Hello Everyone,

    I am migrating a clients old servers into a new replicated cluster, the migration have no issues, neither does ISPConfig, however two seemingly unrelated process starts conflicting after migration:

    Setup : I am trying to migrate 6 servers (A-F) into one cluster (2 VM in Azure, active active replication). In total there are about 7K websites on those 6 servers, each of them creates an fstab entry to the /var/www/<domain> folder as per ISPConfig workings after migrating using the Migrate Tool. I migrate the servers one by one, after each migration I 'resync' the cluster and reboot before I do the next.

    Problem : 5 servers (~5K website, A-E) are migrated with no problems after reboot, but after the last one (having ~2K websites) the target VMs cannot complete a reboot (the migration and resync succeeds though). The startup gets stuck on "Network Name Resolution" and keeps on retrying to start it infinitely. Since this service is responsible for all communication via SSH, I cannot login to the VM.

    Details:
    1. The VMs have 8 vCores, 16 GB RAM, Ubuntu 20.04 and ISPConfig is set up with apache. After migrating all the servers the disk is about 30% empty (~600GB free space)
    2. The last server F, when migrated alone on a fresh cluster, works well so there is no issue with any of the source server.
    3. If I remove the fstab entries by mounting the failed os disk on a new machine and then reattaching the os disk to the cluster, the cluster vms boot up fine. I later manually mount the directories using mount . But this is not a good fix imo
    4. The boot time also slows down as the no. of fstab entries increases. As the entries crosses about 5K, the network name resolution starts to fail, leading to failed startup. I am yet to determine what is the link between these two processes, maybe an OS expert can highlight it. I have noted the boot (service startup) time for various number of fstab entries:
      1. With 20 mounts: boots (mount + name resolution + other services) under 30s
      2. With 1000 mounts: 44s
      3. With 2000 mounts: 1:05min
      4. With 4000 mounts: 2:32min
      5. 5000 mounts: boots up in 3:52min, resolved starts in 2:57 and mentions that it will timeout in 3:04
      6. >5500 mounts : fails to boot, stuck at network name resolution (resolved)
    5. In all these tests I have kept the os / network interface / firewall / DNS / Domain settings etc the same, so there is likely no external factor responsible.
    6. CPU, RAM and Network Usage are not saturated (i.e are much less than ~100% usage) anytime during the boot, so I am not doubting that it is the slow system that is causing the problem.
    Queries:
    1. I linked the creation of entries to ISPConfig's apache plugin /usr/local/ispconfig/server/plugins-available/apache2_plugin.inc.php . Is this the right place to patch the problem ? What is the best way to do it? I am looking into modifying it to create a startup script instead of fstab entries to mount the directories, but any suggestion on this would be much appreciated.
    2. What could be the link between the two? It could be something specific to Azure, but since there is no documentation on this or similar forum posts, it is difficult to pinpoint.
    Please let me know any additional details / logs that are required, any help is much appreciated.

    Regards,

    Shivansh
     
  2. till

    till Super Moderator Staff Member ISPConfig Developer

    Yes

    I have not heard of such an issue yet, but there are probably not many systems with 7000 sites on a single node. The core of the issue is probably that services have dependencies when starting up in systemd, it probably just takes too long to create the mounts, which might then cause other services to wait for this. So there is probably no direct relation of a DNS failure with the bind mounts, just one service waits for another one to complete, and then maybe a third service complains about the second one, but the second one (in this case DNS) is not at fault as it simply has to wait for the first one to complete.
     

Share This Page