Hey there, I try to debug some stuff around the certificate creation with LE on our mirror server nodes which (recently?) broke. Some basic information: - Centos 7 - initially ISPC 3.1.15p3 and now updated all to 3.2.1 - nginx + php-fpm 7.4 - One master, one mirror WEB server (switched to cluster this summer, before that ran it on a single server) - Tried Certbot (1.9.0) as well as acme.sh - /etc/letsencrypt, /usr/local/ispconfig/interface/acme and /root/.acme.sh are NFS shares - Skip LE verification is checked - We have one site with several subdomains and aliasvhosts configured - Cert requests and renewal works in general Already followed the debugging guides with no obvious errors within LE or the config. For some time now (or maybe since we moved to the cluster setup?) the creation of new sites ( vhost + aliasvhost) does not create the LE symlinks within /var/www/[domain]/ssl/[domain]-le.... (anymore?). I can see - and as far as I understand the code that is correct - that the LE code within request_certificates is only executed on the master server. The mirror server does NOT execute any LE code. Now as far as I can tell that request_certificates function which will only run on a master server creates the symlinks within [website_root]/ssl/ => /etc/letsencrypt/archive/... If I create the three symlinks manually, LE SSL works perfectly fine on the mirror server as well. Due to the NFS share syncs all the required cert and renewal config files. Here are the related code parts I diged up so far: Only run request_certificate on master node: https://git.ispconfig.org/ispconfig.../plugins-available/nginx_plugin.inc.php#L1378 Symlink creation within request_certificate: https://git.ispconfig.org/ispconfig...p/server/lib/classes/letsencrypt.inc.php#L507 And I found this patch intentionally disabling LE on Mirror nodes: https://git.ispconfig.org/ispconfig/ispconfig3/-/commit/c1916502f99608148d56989e73250dde56e94d45 So this runs fine on the Master Server: Code: 10.12.2020-00:17 - DEBUG - Calling function 'ssl' from plugin 'nginx_plugin' raised by event 'web_domain_update'. 10.12.2020-00:17 - DEBUG - Calling function 'update' from plugin 'nginx_plugin' raised by event 'web_domain_update'. 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr -i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr +i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: df -T '/var/www/clients/clientX/webY'|awk 'END{print $2,$NF}' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: xfs_quota -x -c 'limit -u bsoft=0m bhard=0m webY' '/' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: xfs_quota -x -c 'timer -bir -i 604800' '/' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr -i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: usermod --groups sshusers 'webY' 2>/dev/null - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr +i '/var/www/clients/clientX/webY' - return code: 0 which: no acme.sh in (/usr/local/ispconfig/server/scripts) which: no acme.sh in (/root/.acme.sh) which: no letsencrypt in (/root/.local/share/letsencrypt/bin) which: no certbot in (/opt/eff.org/certbot/venv/bin) which: no letsencrypt in (/root/.local/share/letsencrypt/bin) which: no certbot in (/opt/eff.org/certbot/venv/bin) 10.12.2020-00:17 - DEBUG - LE version is 1.9.0, so using certificates command 10.12.2020-00:17 - DEBUG - Create Let's Encrypt SSL Cert for: {{MAIN-SITE-DOMAIN}} 10.12.2020-00:17 - DEBUG - Let's Encrypt SSL Cert domains: 10.12.2020-00:17 - DEBUG - exec: /bin/letsencrypt certonly -n --text --agree-tos --expand --authenticator webroot --server https://acme-v02.api.letsencrypt.org/directory --rsa-key-size 4096 {{MAIN-SITE-DOMAIN}} {{SUBDOMAINS}} --webroot-path /usr/local/ispconfig/interface/acme Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator webroot, Installer None Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org Cert not yet due for renewal Keeping the existing certificate which: no letsencrypt in (/root/.local/share/letsencrypt/bin) which: no certbot in (/opt/eff.org/certbot/venv/bin) 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: Found the following matching certs: 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: Certificate Name: {{MAIN-SITE-DOMAIN}} 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: Serial Number: XXX 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: Domains: {{MAIN-SITE-DOMAIN}} {{SUBDOMAINS}} 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: Expiry Date: 2021-03-09 22:15:20+00:00 (VALID: 89 days) 10.12.2020-00:17 - DEBUG - LE CERT OUTPUT: Certificate Path: /etc/letsencrypt/live/{{MAIN-SITE-DOMAIN}}/fullchain.pem 10.12.2020-00:17 - DEBUG - Found LE path: /etc/letsencrypt/live/{{MAIN-SITE-DOMAIN}}/fullchain.pem 10.12.2020-00:17 - DEBUG - Let's Encrypt Cert file: /etc/letsencrypt/live/{{MAIN-SITE-DOMAIN}}/fullchain.pem exists. 10.12.2020-00:17 - DEBUG - safe_exec cmd: ln -s '/etc/letsencrypt/live/{{MAIN-SITE-DOMAIN}}/privkey.pem' '/var/www/clients/clientX/webY/ssl/{{MAIN-SITE-DOMAIN}}-le.key' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: ln -s '/etc/letsencrypt/live/{{MAIN-SITE-DOMAIN}}/fullchain.pem' '/var/www/clients/clientX/webY/ssl/{{MAIN-SITE-DOMAIN}}-le.crt' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: ln -s '/etc/letsencrypt/live/{{MAIN-SITE-DOMAIN}}/chain.pem' '/var/www/clients/clientX/webY/ssl/{{MAIN-SITE-DOMAIN}}-le.bundle' - return code: 0 10.12.2020-00:17 - DEBUG - Enable SSL for: {{MAIN-SITE-DOMAIN}} On the same time, this runs on the mirror server and finishes much earlier then the master server as there are not LE API calls: Code: 10.12.2020-00:17 - DEBUG - Calling function 'ssl' from plugin 'nginx_plugin' raised by event 'web_domain_update'. 10.12.2020-00:17 - DEBUG - Calling function 'update' from plugin 'nginx_plugin' raised by event 'web_domain_update'. 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr -i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr +i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: df -T '/var/www/clients/clientX/webY'|awk 'END{print $2,$NF}' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: xfs_quota -x -c 'limit -u bsoft=0m bhard=0m webY' '/' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: xfs_quota -x -c 'timer -bir -i 604800' '/' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr -i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: usermod --groups sshusers 'webY' 2>/dev/null - return code: 0 10.12.2020-00:17 - DEBUG - safe_exec cmd: chattr +i '/var/www/clients/clientX/webY' - return code: 0 10.12.2020-00:17 - DEBUG - SSL Disabled. {{MAIN-SITE-DOMAIN}} Currently I cannot see any way - maybe except of the cert renewal process which does run on each server according to the letsencrypt.log - how the SSL Symlinks should be created on the Slave. One side note: We tried acme.sh and from what I can tell, there is no way to make it work on master/slave setup at the moment, because the script only installs the cert file on the master server as well. So: Am I doing something wrong? Am I missing something? Or is LE + ISPC Multiserver still a bit quirky? Or even unsupported? From what I can tell right now, it seems a bit odd that the relevant code is excluded on the mirror node. But I believe there were reasons to do so back in time And maybe i also need some sleep Thanks alot and best regards, Jan
Hi Thom, thanks for getting back to me. This would be the "easy" solution, I agree. But not a good one. PHP Applications tend to not-perform on NFS shares. I talked to some people at Zend about this, and they strongly adviced against running PHP apps on NFS shares at all. We also did evaluate this and ended up having horrible response times within the application. So we only have the static assets as well as some file cached on NFS what I would also assume best practice. The application itself is distributed by deployment scripts to the Web-Servers. Yet I still don't believe that /var/www on NFS would fix the issue. The timing issue would remain. Because the mirror node will check for the existance of the symlinks before the master finishes the LE request and before the symlinks are created. But true, on the second time after de-/activation of SSL + LE it should work because the symlinks are there already. So syncing the /ssl/ folder of a website might be a hacky workaround. Anyway, after a fresh night of sleep, I believe there are just some minor changes required to get LE to work on a Multi-Webserver environment without these workarounds. So this is my proposal for a solution: Base requirement: /usr/local/ispconfig/interface/acme and /etc/letsencrypt (for certbot setups) or /root/.acme.sh (for acme.sh setups) are Live Sync shares (NFS , Gluster, etc.) The current issue: LE Certification is a timely task. ISPC Mirror servers just replicate, they do not know when the certificate is actually available. Multiple parallel LE requests will also not work due to possible race conditions. So it is ok to assume that enabling the "SSL + LE" flags in Website have to limit the cert requesting servers to one. This is currently programmatically fixed to the Master Server which is not ideal due to availability limitations, but for now ok. Now after the Master has fetched the cert, there is currently no way to inform the mirror web-servers about this. The mirror servers already processed their "SSL + LE" update, couldn't locate the certificate symlinks within /ssl/ and assumes a "No SSL" state for the website and fallbacks to "Disable SSL for Website". The missing (sym)link and the properly timing are the only two missing pieces I identified so far. In addition to that, we have to distinguish "Cert creation" and "Cert renewal" as those are two separate processes. Proposed solution: #Initial creation: Split the "request_certificate" (letsencrypt.inc.php) function into three independent steps (+ bootstrapping like certbot vs acme and stuff): Check for existing cert and its metadata like serial or date ("check") Fetch cert from LE using Certbot or acme.sh ("fetch") Update website config in nginx or apache with SSL ("update") In the Webserver Plugin ( e.g. nginx_plugin.inc.php, apache_plugin.inc.php ) the call of the letsencrypt module is limited to the SSL config and the mirror status of the current server. A new flow could look like this: ##On the Master: check, fetch, update (like the current flow) -> On success: Write metadata (timestamp or the cert serial) to a new website config field (le_cert_timestamp / le_cert_serial) and replicate this to the mirror nodes together with the still enabled SSL + LE config. ##On Slave: When data is replicated (SSL + LE enabled, new cert metadata) with new timestamp or a new serial, compare against local config timestamp or local serial. If they mismatch, run the following Letsencrypt Tasks: check, update This should then recognize the existing cert in /etc/letsencrypt or even /root/.acme.sh using the current commands to find existing certs. The "update" phase should then create the necessary symlinks or copy over the cert files. This will run the local relevant code parts at the right time on all the mirror servers and limit the remote cert fetch to the master server. It should be sufficient to split the current request_certificate function into smaller functions and call them directly as required instead of one monolithic call. Regarding availability of the fetch process: Instead of limiting the fetch phase to the master, one could consider working with a lockfile in one of the shared folders. The first server running into the fetch writes it out and uses this to prevent parallel fetches for this particular website. But this is a nice to have to not have to rely on a particular single server in a high availability setup. #Renewal: Now for the certbot renewal process: I was a bit irritated, that this (900-letsencrypt.inc.php) runs actually on all servers. There is no limit to the master. Neither is for the acme.sh as it runs on OS level and is triggered by cron. So only the server where the renewal happens reloads nginx or apache after the cert renewal is done. I think that's a bit odd? So also adding the same information about the new cert to the database as I proposed earlier within the creation process would allow all servers to remain synced and react to updated LE certs. What do you think of this approach? Best, Jan
Would it work to just copy the certificate files to the other hosts from the master host that creates and renews certificates?
Currently that would not be enough, as the LE certs themselves remain in the certbot folder and the required and missing symlinks are only ever created on the master. But if the real cert files would be copied to all servers /ssl/ it should work. Thinking this one step further: Storing the content of the LE certs within the (Non-LE) SSL fields within the ISPC website Config, ISPC should sync the certs out as files to all the hosts already. Wouldn't it? So yes, if there were real files stored in the DB and no symlinks, it should work. This might be another approach in handling it and would also be an argument for less custom handling of the LE certificates compared to non LE certs. Is that what you were thinking about?
I think most of the pieces are there to make that possible (including update permission in master db from the slaves), but I expect (haven't verified) there's logic to not write the certificate files if letsencrypt is in use, and something (probably new column in web_domain?) on the master would need flagged to tell it to generate sys_datalog records for all the mirrored servers whenever that field is updated. I don't know that this is the best approach (I'm not "for" or "against" it), just filling in some pieces of the puzzle.
What's difficult aswell is that the mirror doesn't know if/when the cert is in place so when to make the symlink. Retrying could cause infinite retries.
I still would argue for a solution where certbot or acme remain in control of managing the actual files instead of storing them in the ISPC DB. I love ISPC for being lightweight and flexible. I fear that parsing certs from a client like certbot or acme and storing them might add complexity and reduces flexibility of ISPC. The current cert flows would work for Multiserver if the central code flows would just also be executed on the mirros ("check" and "update") as I suggested. But I am open for any working solution here that improves the current issues I don't see any real issue there. The node running certbot / acme does exactly know when that is the case. So the timing is when the master node is finished and enables SSL for the domain. That's why I wrote that a sync event has to be re-triggered then. I think writing a datalog is the appropriate termin in ISPC speech as Jeese wrote? With this approach, there is no guessing and no possible race condition. Also no blocking on the mirrors. So I would preferably argue for a "push" solution instead of a "pull" solution, where the slave "tries" to guess timing.
@Jesse Norell @Th0m @Taleman Could you give me some hints on how to programmatically retrigger the (vhost / ssl) actions on the mirror servers? I could then propose a first PR draft to continue discussion based on some code Thanks guys
Which actions do you mean exactly? Best thing to do is go through the plugins responsible for the actions you mean and see which happens where and why.
@Th0m The one part I havent' yet figured out by checking the plugins code is how to programmatically re-invoke the nginx_plugin for example on mirror nodes. I would like to programmatically retrigger the SSL actions on the mirror servers AFTER the SSL Letsencrypt action was successful on the master. As far as I understand, to make this work a datalog has to be created which would then be synced and executed on all servers. To be more precise: I would like the same things to happen, as when I manually disable and re-enable the "SSL + LE" checkboxes in the frontend. I don't think that this is done anywhere around the vhost / ssl plugins at the moment. So some hint of someone who knows the codebase would save me a lot of trial and error
I posted about LE renewal hooks, adding a copy there would be possible. https://www.howtoforge.com/community/threads/lets-encrypt-lots-of-errors-in-standalone.79363/ I do not know which hooks can be added to ISPConfig actions.
Thanks for this suggestion. But this would only work for certbot, wouldn't it? The acme.sh (and other implementations) would not work with this. Or would they?
This can be done on the master only as only the master can write new records in the datalog table. What you might do is to add some code in the ssl function of the nginx and apache plugin that checks if the le cert is already there (only on mirror servers of course) , and if not, sleeps for another 10 seconds or so and checks again. You should limit this to e.g. 2-5 minutes though and then skip ssl with logging a failure to ensure that we don't have endless waiting zombie processes.
Thanks @till ! Just to get this right: The "master" as in the "ISPConfig Master Server" which also runs the interface? And not the Master as in the Master -> Mirror relationship. Right? If so, I would prepare some code using the sleep method you suggested. Although I personally dislike blocking approaches. Yet it still might be the lesser evil.
Yes, the datalog can be written from interface server for security reasons. It's not ideal indeed, but opening up the datalog from slave nodes would be really bad security-wise as it would allow an attacker to potentially take over all other nodes when he is able to hack one slave node.