Error let's encrypt renewing cert

Discussion in 'Installation/Configuration' started by Darkmagister, Mar 8, 2022.

  1. Darkmagister

    Darkmagister New Member

    Hello, i have an ispconfig 3 installed on a debian, and i have one domain (of 3 that i have on the server) that have the auto www subdomain, that have issue when renewing the let's encrypt cert, the log of acme doesn't have any issue but the .key file is empty and no virtual host 443 is created, i've tried to issue a new cert manually with acme.sh but i don't know it very well and i'm not able to create a correct certificate, and also now i reached the rate limit, is it possible to have the acme.sh command that ispconfig execute, or at least can i tell ispconfig to create a new key file ???
    or any ideas that can be helpfull, i could create a new cert for *.domain.com and as this is another domain it will bypass the rate limit, but without the correct acme command i don't know what to do.

    thanks
     
  2. till

    till Super Moderator Staff Member ISPConfig Developer

    Do not manually use acme.sh command to create certs for websites on ISPConfig systems. Use the Let's encrypt error FAQ to narrow down the reason for your issue step by step.

    https://www.howtoforge.com/community/threads/lets-encrypt-error-faq.74179/

    acme.sh should handle that on it#s own. But you can use acme.sh command to delete the cert incl. it's key if it has been damaged.

    You can't create wildcard certs through ISPConfig and also not manually easily as LE will not issue domain validated wildcard certs. Wait until rate limit has expired.
     
  3. I was going to start a new thread with this exact same issue, but I see now that I'm not the only one with the same problem, and, AFAIK, it's not a very easy one to solve.
    Configuration recap: ISPConfig 3.2.7p1 running over Ubuntu 20.04.4 LTS (should be similar enough to the OP's Debian install), self-compiled nginx (to get access to more modern goodies, such as HTTP/3) using nginx-autoinstall, PHP 7.4 + 8.0 + 8.1 as selectable PHP versions (7.4 currently the default), latest MariaDB from the default repositories, and using acme.sh (which I manually self-update from time to time to the latest version). Almost everything in my configuration is up to date with the standard packages from 20.04.4, although in some cases (PHP is a good example!), since Ubuntu LTS tends to trail behind contemporary versions, I use non-standard, popular PPAs.
    However, I believe that the issues are not related to the installation at all, but rather with a conceptual issue which is not so easy to 'fix'.
    Also note that, to add another layer of complexity, I run Cloudflare on top of all my websites; however, occasionally, I wish to turn Cloudflare off just to see if my server is responding correctly — i.e. ruling out Cloudflare as a source of potential problems (it rarely is, though) — and that's why all domains (except one, which I don't have any access to) also have Let's Encrypt certificates (for those that don't know: if you use Cloudflare to protect/cache your websites, you do not need 'real' certificates to get access to Cloudflare's HTTPS service — Cloudflare will generate a free self-signed certificate for you, which will only work within Cloudflare, not on the open Internet, which is a good option if you don't expect to ever turn Cloudflare off again).
    Ok, so here is the issue. Let's start assuming that everything works fine on the first time that you use ISPConfig 3 to 'talk' to Let's Encrypt via acme.sh. A brand new certificate is correctly issued, stored at the correct place where it's supposed to be, nginx gets restarted, loads the new certificate correctly, ISPConfig 3 verifies that everything is running as it should, and activates whatever it needs on its internal logic to add the proper lines for SSL configuration. The backoffice correctly shows both SSL and Let's Encrypt SSL being active, as expected.
    Everything works fine until two months elapse (that's the default timeframe). The acme.sh daily check for expiry dates will run and notice that it should now ask for a reissue of the certificate, so it runs whatever commands it needs to request it from Let's Encrypt.
    Now let's suppose that this fails for some reason.
    There are a lot of issues that can happen, beyond Till's FAQ, which have nothing to do with the configuration. It might just be lack of disk space. There may be a directory with wrong permissions, or that mistakenly was changed to the wrong user. The Internet connection might be down, or unreliable, at least between your server and Let's Encrypt, and, since the re-issue procedure is not instantaneous, there might be a connection timeout at some stage. Or your server may be hit by happy hackers and is being dragged down to a snail's pace — meaning that whatever reasonable timeouts have been programmed into acme.sh are not enough to deal with such spikes in CPU loads.
    Under normal circumstances, the above is not a problem. After all, acme.sh starts its re-issue procedure one month before the expiry date — so it's ok if it fails one day or two. Or even a week. That's the whole point of re-issuing the certificate well before the actual expiry date. So far, so good.
    Now, for those unpredictable network/disk/CPU errors/surges/failures, which are usually temporary (even if they last hours or days!), this approach works well. I mean, if your server is under attack for a whole month so that it cannot do anything sensible, you have far more issues than worrying about Let's Encrypt timeouts! acme.sh is quite generous in giving you time to fix things, and it will patiently try over and over again, day after day, until eventually it manages to contact Let's Encrypt and get a re-issued certificate. Nothing to worry about!
    But what about systematic errors? In other words: imagine the simple scenario of having a directory somewhere with the wrong user and/or permissions, which acme.sh needs to complete the procedure. Or, say, you have a failing nginx vhost, which, under normal circumstances, you haven't even noticed, because it's an obscure vhost doing little (so nobody complained so far). When acme.sh actually reconfigures nginx to conform to the ACME specs, nginx may fail to start, acme.sh detects this, and reverts the configuration so that nginx starts again. But when this happens, naturally, the certificate does not get re-issued. While acme.sh will try on subsequent days, if the condition persists, it will always fail, day after day...
    Until the certificate actually expires.
    At that point, the sysadmin will finally realise something is wrong, since that site will now show an invalid certificate and be effectively shut down from the Web. They might even notice that the issue is not related to acme.sh at all, but rather with some obscure issue on, say, the filesystem setup, or the nginx, or whatever it is (hint: on Un*x, on 90% of the cases, it's a permissions error; the remaining 10% is almost always a missing semicolon somewhere, or a misspelled word :cool:).
    Once that stupid (but systematic!) mistake is correctly, the sysadmin attempts to get acme.sh to run again. But now it faces a Catch-22: to actually be able to reconfigure acme.sh again, according to the ACME protocol, it needs SSL to be operational and enabled. But because SSL requires a valid certificate, nginx won't start. And that means that acme.sh has to abort and revert to the previous nginx configuration — and that means loading the expired certificate again. No matter what you do, there is no obvious way to get this to ever work again.
    ISPConfig 3 actually has some built-in logic to address this issue: namely, when requesting a new certificate (or revalidating an existing one), it allows plaintext HTTP connections, limited to only the ACME-related configuration files. It's a reasonably safe bet — sure, you won't have an encrypted connection to handle the Let's Encrypt (re-)validation process, but since only encrypted files will be returned that way, it's safe to assume that a very short-lived security 'hole' punched through port 80 will not be a major security issue.
    However, the problem is that, these days, we tend to configure everything to prevent any communications to port 80. Naturally, your mileage may vary, but here is what I do:
    1. ISPConfig redirects all HTTP requests to HTTPS (this is done on two different backoffice panels) via nginx configuration.
    2. The Linux firewall blocks all attempts to connect to port 80 (which do not come from my 'secure' network, which only uses private addressing, not exposed to the Internet).
    3. Additionally, I also configure my provider's firewall to do the same.
    4. On top of everything, Cloudflare not only redirects HTTP to HTTPS (again, for some reason, there are two different panels on their backoffice for doing so...), but I use what they call a strict configuration, which means that only HTTPS traffic will go through their services to my server (technically, this is overkill paranoia, but I nevertheless do it).
    5. And, naturally enough, I have HSTS turned on, which means that any HTTP client (browser or otherwise) which had used HTTPS before will refuse to now connect via plaintext HTTP (actually, I have even a few more rules on nginx and Cloudflare to refuse service in the case that the type of connection changes — excluding the certificate's expiry date, which is allowed to change — to robustly rule out the more common man-in-the-middle attacks).
    6. I use DNS with DNSSEC for almost all domains, and, naturally enough, host all DNS zones on Cloudflare — not on my own servers! — since that's how Cloudflare works (and Cloudflare is also configured to properly detect that only Let's Encrypt is allowed to emit valid certificates for my domains, and to reject any other provider attempting to do the same — so, no, you can't just steal my certificates and set up your own server with my domains; Cloudflare will reject requests to such a hijacked server).
    There might be one or two additional layers on top of all the above, which I might have forgotten to enumerate, but the point is simply that either HTTPS is working with a valid Let's Encrypt certificate, or my setup will stop working, and it's anything but trivial to 'override' it or provide a loophole/backdoor on the setup for 'emergency purposes' (hint: malicious professional hackers — not script kiddies! — will always figure out such loopholes... they're experts at detecting them, after all!).
    In other words: if, by some reason, acme.sh fails to revalidate the certificate automatically, and the LE certificate hasn't been re-validated by the expiry date, it will never be able to re-issue such a certificate automatically. In fact, the system will even prevent acme.sh to manually re-validate that certificate (or even issue a brand new certificate from scratch) for the domain that expired.
    To make things even harder, as the OP has so well explained, when things go seriously wrong, acme.sh, for some unexplainable reason, sets the key and certificate to zero bytes. I'm pretty sure that this did not happen under certbot; when things failed, you'd be just stuck with the old (but invalid) certificate. Under the current setup, however, the default scenario seems to be to truncate the sensitive files first, request new ones from Let's Encrypt, and write them over the old ones. I don't know, perhaps this is meant as a security issue, and it's done so by design. It might be an attempt to prevent having corrupted, partial files on disk; either the re-validation process terminates atomically, or the file will remain with zero bytes. Or it might just be something more convoluted in acme.sh's code that makes it behave that way. The point here is that, upstream, ISPConfig will think that there is an updated certificate on disk (the date will have changed, after all), but, because it has zero bytes, nginx will fail to load, with a cryptic error such as 'invalid certificate format' (sure thing, having zero bytes is most definitely an invalid format!), thus acme.sh will have no option but to restore the configuration — but this time, it will retain the zero-byte files. In other words, the catch-22 became now even more serious: acme.sh not only failed to re-validate a certificate, but it will prevent any further attempts from succeeding by truncating the sensitive files to zero bytes while changing their dates — and that is true for both manual and automatic validation. Whatever tools are running to check the integrity of the system will find those two zero-byte files with correct permissions and a changed date reflecting the last time acme.sh ran — all seems to be fine — so nothing will get flagged as suspicious (zero-byte files are perfectly reasonable to have in many circumstances). You can restore the files from backups, of course, but you'll still be stuck with certificates with expired dates, so the whole process will remain stuck. And, as the OP so well noticed, insisting to revalidate certificates manually — which will always fail — will eventually hit the rate limit for unsuccessful attempts at Let's Encrypt, and you'll be graylisted for a day.

    You may then ask how I manage to get my sites still running after three months!?

    Well, I cheat.

    Because ACME-based re-validation or re-issue will be effectively broken forever once the first certificate expires with a failed re-validation (since, due to the way nginx is configured, all vhosts must be correctly set up — if one vhost fails, nginx fails for every vhost — that's why both acme.sh and ISPConfig are so careful in restoring the configuration to a previously working one in case one change will break everything), obviously, I cannot resort to ACME-based re-validation. Fortunately, Let's Encrypt has other methods to allow issuing and revalidating certificates — I use DNS-based validation, since that will always work, irrespectively of nginx's current status, and whatever firewall configurations and levels of security I have set up for Web services. Also, because my DNS is actually set up at Cloudflare — and not on my own server — it's far more likely that I won't be able to break my DNS configuration by mistake with a 'wrong' script.
    acme.sh, for that purpose, can work in two ways. The first — reminiscent of how certbot originally worked with DNS-based validation — is doing a manual request. This is cumbersome and cannot be automated: it requires a human to set up a temporary TXT entry on the DNS zone, request Let's Encrypt to check it, and then run the re-validation (or new issue) manually again — and remove that TXT entry afterwards (it can only be used once, anyway). If you have a bunch of domains and subdomains to validate, it takes an eternity.
    acme.sh also uses a more automatic approach. Most DNS providers will also expose an API to allow authenticated users to change DNS on their servers, using automated scripts (ISPConfig is not an exception! You can also use its built-in API to do that!). acme.sh includes a modular, plugin-based approach to handle the many different ways that providers expose such APIs, and deals with most of them in an absolutely transparent way; you just need to pass the extra parameter --dns dns_XXX where the XXX is the specific API plugin for your DNS provider (aye, there is one for ISPConfig, too!). If your particular provider isn't directly supported, you can write your own plugin, too — there are plenty of examples to help you out with that.
    Everything else remains the same, and the two methods (ACME-based or DNS-based) can be used interchangeably — they will respect the same directory layout, etc. In other words, you can manually revalidate the domain's certificates using the DNS approach, and ISPConfig will happily use whatever acme.sh has manually generated. Yay for that!
    You may ask why there even is an ACME-based approach, if DNS-based validation is so easily accomplished and automated (if an API is available, of course). The answer is simple: not everyone has full access to their DNS zone setup; and often you might have to administer web sites for which you absolutely require a certificate, but where you neither own the domain, nor have any access whatsoever to its DNS configuration. Many clients may be willing to point the IP address of their www.domain.tld to your server, but that's how far they will go. For security reasons, even if they have access to an API for their DNS provider, they will never share that key with you. And even while some APIs allow a very limited access to DNS using a specially-configured token for that purpose — enough to add a TXT record and little else — you can imagine that many corporate customers will not even then share such a token with you. Here's a good example: I have an address which actually comes from Dyn's Dynamic DNS (now an Oracle company); I don't 'own' the domain's DNS zone, nor have I any way to change it — except for setting an IP address dynamically. Thus, in such a scenario, if all I had was DNS-based validation, I couldn't run a HTTPS server on that address.
    ACME-based validation neatly side-steps that issue, by using Web-based validation using the .well-known approach. That way, you can get certificates from Let's Encrypt for websites that you don't 'own' but have to nevertheless host (and get certificates for it). When properly configures, ACME-based validation is actually much easier to set up than DNS-based validation — and it also happens 'instantly' (as opposed to DNS, which may take some time to propagate, even though these days this usually happens in seconds, not in days or weeks as it was common practice two decades ago...).
    Therefore, ACME-based validation is always preferrable than DNS-based... unless, of course, you get stuck into a catch-22 where you need a working web server to get proper certificate validation, but, because you lack that valid certificate, your website may not run — thus preventing ACME-based validation to succeed!
    As a side note: there is another workaround when you cannot use ACME-based validation and do not 'own' the DNS domain zone. There is a way to get LE certificates using a 'proxy' DNS zone — a 'delegate', if you wish, of the 'real' domain. LE will request the validation token to be set on the 'delegate' instead of on the real DNS zone (the one that is not under control). It was originally designed for those who want to use DNS validation but are not happy with making changes on the 'real' DNS zone (for whatever reason); instead, they do all the changes on the proxy/delegate instead, which is assumed to be an easier-to-change DNS zone (in the sense that it's not used for any other purpose). It's even more trickier to set up, but it does work well (and that's how I can use that Dynamic DNS-based address to run a web server with HTTPS using a valid certificate!).
     
  4. So, what are the options here?

    I'm not quite sure how ISPConfig integrates with acme.sh, but it's reasonably straightforward to understand how the whole ACME setup is securely done (and that's why it's a bit complex and subject to many peculiar rules to us ISPConfig users!). All I could suggest, in this case, is to add some extra code to allow DNS as a fallback method to do the validation step. I still think that the ACME-based validation should, in all cases, be the default and main method of validation. But perhaps one could check if the ACME-based validation failed a number of times, and have a backoffice option to configure what DNS plugin should be used with acme.sh, as well as a way to fill in the necessary API parameters. This is sadly hard to do, because, as said, each DNS provider has their own API, and that means different methods of authenticating the requests. But, alas, I guess it would be do-able...

    Until then, in my personal case, like the OP, I'm stuck with manual DNS-based validation for all domains that fail...
     
  5. Aaah... by sheer coincidence I noticed just right now that my setup is probably not the best for using with HTTP-based ACME validation.
    I didn't know before reading this article, but now I learned that ACME definitely requires port 80 to be open and accessible.
    While the reasoning behind it is a bit murky (well, at least debatable...), the point is that there is no way to use HTTP-01 validation without having port 80 available — and a webserver configured to respond to unencrypted traffic behind it as well.
    This will exclude any configuration that is just slightly more complex than having your web server fully exposed to the Internet. Add a protection layer, a reverse proxy + cache, or a CDN, and poof, you're out of luck with HTTP-01 validation.
    So, for configurations just like mine, unless I start punching holes through all the security layers, there is no hope of ever getting it to work correctly. All I can do is to rely on DNS-01 validation — or, optionally, TLS-ALPN-01 validation, which is a 'new' way to accomplish essentially the same thing using encrypted communications which uses a different port — thus avoiding any limitations placed on port 80. Needless to say that it means giving external access to a new port — even if its usage is strictly restricted.
    I guess that DNS-based validation is still my best 'fallback' alternativhttps://github.com/go-acme/lego/issues/1604e.
    On the other hand, it looks like sticking with acme.sh, if I'm not using it for its intended purpose — e.g. automating the whole process of rewriting the nginx configuration — it might make more sense to rethink the whole approach. As I have described previously, DNS-based validation does not require rewriting a single line of the web server configuration — it becomes irrelevant, since the whole process never needs to deal with a valid .well-known/acme-challenge directory to write its tokens. As such, there might be better tools around than acme.sh. Since I'm a Go fangirl, I have found Lego very promising. It doesn't support ISPConfig's API straight out of the box, but I've already added a request to do so.
     
  6. till

    till Super Moderator Staff Member ISPConfig Developer

    Gwyneth Llewelyn likes this.
  7. Thanks for the tip, @till!
    While it doesn't apply exactly to what I need, at least it gave me a clue on what to look for:
    1. acme.sh will 'remember' the last configuration you used, so if you switched away from web-based validation, this will be written (automatically) on the /etc/letsencrypt/your.domain.name/your.domain.name.conf file. Neat!
    2. ISPConfig will not interfere with whatever acme.sh thinks that it should use for validation. That's actually quite cool and wasn't obvious to me!
    3. You can force that the first request for a LE certificate is made via the DNS API, and that just requires a reasonably simple patch (especially if you're using ISPConfig to directly manage DNS; but it should work for other providers as well).

    In practice, as also reported elsewhere, this first request usually succeeds (in my case, it has succeeded 100% of the first times...), it just seems that subsequent requests will fail — so that means that you'll have to remember to revalidate those domains using the DNS API, and, from then on, it should do so automatically.

    Well, that's the theory, at least. But now I believe I understand a bit better why some domains have no issues whatsoever (because they seem to be already configured to use the DNS validation!) while others are always a nightmare (because... well, I'm not sure why, to be honest!).
     

Share This Page