This is a perfect server Debian Jessie system running ISPCFG3 + nginx with the usual postfix dovecote amavisd-new spamassassin setup After an upgrade from mysql 5.6 to 5.7 which went slightly wrong but seems fixed now I am now experiencing timeouts with amavis meaning every email is being deferred for minutes up to hours. I'd be grateful for any pointers. Funny enough, nothing was changed when it comes to the mailing system so it must be related to the mysql upgrade. I have changed the log level for amavis to 5 and couldn't spot any errors so its down to 4 now but still nothing jumps out at me. Code: message deferral detail ----------------------- error (total: 110) 86 lost connection with 127.0.0.1[127.0.0.1] while sending end of... 11 lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO 9 conversation with 127.0.0.1[127.0.0.1] timed out while sending... 2 address resolver failure 2 conversation with 127.0.0.1[127.0.0.1] timed out while receivi... smtp (total: 590) 349 lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO 171 lost connection with 127.0.0.1[127.0.0.1] while sending end of... 33 Host not found, try again 14 conversation with 127.0.0.1[127.0.0.1] timed out while sending... 9 conversation with 127.0.0.1[127.0.0.1] timed out while receivi... Syslog also notes some of these DB issues but I have tested the user/password for dbispconfig inside that file and they are ok. This coudl explain the deferred emails but how to fix or debug? Code: Apr 13 07:43:00 alfred postfix/smtp[30110]: warning: mysql:/etc/postfix/mysql-virtual_relaydomains.cf: table lookup problem And this is what I did when upgrading mysql: Code: dpkg status changes Installed: libio-socket-ip-perl:all 0.32-1 libmecab2:amd64 0.996-1.1 mysql-server:amd64 5.7.12-1debian8 mysqltuner:all 1.3.0+git20140812-1 Reinstalled: mysql-apt-config:all 0.7.0-1 Upgraded: libmysqlclient18:amd64 5.6.29-1debian8 => 5.6.30-1debian8 libmysqlclient18:amd64 5.6.30-1debian8 => 5.7.5-m15-2debian8 linux-libc-dev:amd64 3.16.7-ckt25-1 => 3.16.7-ckt25-2 mysql-apt-config:all 0.5.3-1 => 0.7.0-1 mysql-client:amd64 5.6.29-1debian8 => 5.6.30-1debian8 mysql-client:amd64 5.6.30-1debian8 => 5.7.12-1debian8 mysql-common:amd64 5.6.29-1debian8 => 5.6.30-1debian8 mysql-common:amd64 5.6.30-1debian8 => 5.7.12-1debian8 mysql-community-client:amd64 5.6.29-1debian8 => 5.6.30-1debian8 mysql-community-client:amd64 5.6.30-1debian8 => 5.7.12-1debian8 mysql-community-server:amd64 5.6.29-1debian8 => 5.6.30-1debian8 mysql-community-server:amd64 5.6.29-1debian8 => 5.7.12-1debian8 mysql-server:amd64 5.6.29-1debian8 => 5.6.30-1debian8 Removed: mysql-community-server:amd64 5.6.29-1debian8 mysql-server:amd64 5.6.30-1debian8
Please try the mysql settings that I mentioned in your other thread. without that setting (or with strict_trans_tables), ispconfig 3.0.5. will fail as well with mysql 5.7.
You lost me a bit with " strict_trans_tables" - I googled and apparently its default with mysql starting 5.7.5. I run MySQL on 127.0.0.1 (5.7.12) Any more advice how and where to check this? ###edit### double checked this is in my my.cnf Code: [mysqld] # added after upgrade to mysql 5.7 sql_mode=NO_ENGINE_SUBSTITUTION so no strict_trans_tables there.
I did not had this issue yet and I switched all newer servers to mariadb, so I dont use MySQL 5.7. Maybe the new MySQL version has too short timeouts, did you try to set longer timeouts? Another possibility is that this is a false error or in other words, one that dont matter. Do you have actaul mail problems or just log messages about timeouts, it might be that amavis tries to keep the connection open but mysql closes it after a longer idle period, in this case, amavis should create a new connection automatically, but you may see such errors in the log without a real issue behind them.
hm, I'll look into the timeouts, thanks but the server isn't really busy at all... shall I look at mysql timeouts (current settings) wait_timeout = 60 interactive_timeout = 180 or amavis timeouts? I found these 2 so far: $child_timeout $smtpd_timeout THERE ARE REAL issues as every incoming and outgoing email is deferred for a while. I've been restarting amavis and doing postqueue -f regularly today, otherwise emails got stuck for minutes up to hours.
The timeout problem is an issue on servers with low load (it happens when no new mail comes in within the specified timeout period), so when the timeout is e.g. 1 minute and you don't receive a message every minute, then you might get such an error. At a higher load, the connection will not time out as it don't gets idle too long. The timeouts are in MySQL, not amavis. Try to increase both to e.g. 5 minutes and check if this changes the behaviour or if you get less timeouts.
Ah, I understand. I have now changed the 2 mysql timeouts to 5 minutes as advised and will observe and report back here.
ok, so those timeouts didn't help at all. additionally I noticed that inside /etc/postfix/master.cf I had Code: amavis unix - - - - 2 smtp -o smtp_data_done_timeout=2400 -o smtp_send_xforward_command=yes but inside /etc/amavis/conf.d/50-user was Code: $max_servers = 20; and I read that those must match so I adjusted it to Code: $max_servers = 2; and yet I see mails being deferred, here is one example: Code: root@alfred:/# cat /var/log/mail.log | grep 40BE9520D3A Apr 13 17:13:03 alfred postfix/smtpd[26559]: 40BE9520D3A: client=za-smtp-delivery-153.mimecast.co.za[41.74.201.153] Apr 13 17:13:03 alfred postfix/cleanup[26677]: 40BE9520D3A: message-id=<[email protected]> Apr 13 17:13:04 alfred postfix/qmgr[8500]: 40BE9520D3A: from=<[email protected]>, size=20857, nrcpt=1 (queue active) Apr 13 17:13:04 alfred postfix/smtp[26683]: 40BE9520D3A: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10024, delay=1.5, delays=1.5/0/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO) Apr 13 17:23:00 alfred postfix/qmgr[8500]: 40BE9520D3A: from=<[email protected]>, size=20857, nrcpt=1 (queue active) Apr 13 17:23:00 alfred postfix/smtp[29541]: 40BE9520D3A: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10024, delay=598, delays=598/0/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO) Apr 13 17:33:00 alfred postfix/qmgr[8500]: 40BE9520D3A: from=<[email protected]>, size=20857, nrcpt=1 (queue active) Apr 13 17:33:00 alfred postfix/smtp[32666]: 40BE9520D3A: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10024, delay=1197, delays=1197/0/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO) Any more advice? ###edit### Checking for listening amavis processes, shouldn't they be "listening" instead of close_wait? Code: root@alfred:~# netstat -a | grep 10024 tcp 0 0 localhost.localdo:10024 *:* LISTEN tcp 1 0 localhost.localdo:10024 localhost.localdo:39738 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:43624 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:43344 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:34184 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:38856 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:59782 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:35630 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:50686 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:58866 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:53602 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:32970 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:44712 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:59774 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:36006 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:36554 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:58864 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:48692 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:39708 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:39054 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:38282 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:50688 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:33054 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:34992 CLOSE_WAIT tcp 16740 0 localhost.localdo:10024 localhost.localdo:41910 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:39058 CLOSE_WAIT tcp 1 0 localhost.localdo:10024 localhost.localdo:38272 CLOSE_WAIT tcp6 0 0 localhost:10024 [::]:* LISTEN
Sorry to insist but I'm stuck here. Any more ideas? Shall I up the timeouts of MYSQl even more? Will this not negatively influence its performance? (I don't really care about RAM right now, the machine has plenty and handles very low email/web volumes)
FIXED. I got this under control by slowly increasing the wait times for mysql until I hit the sweet spot with these settings: Code: wait_timeout = 3600 interactive_timeout = 3600 Next I'll probably slowly decrease the interactive_timeout as an amavis connection will definitely not take that long I don't think this will be any problem in the future as this is a very low-volume mail and web server. Any further input would be appreciated.
The problem is back, I've been restarting amavis once every half an hour and then doing a postqueue -f the whole day. Any more pointers are appreciated. This problem only surfaced today, not aware of anythign that could have triggered it :-(
as a last resort I found this thread: http://www.iredmail.org/forum/post41574.html#p41574 recommending: Code: $max_servers = 1; $max_requests = 1; lets see how that goes.
I'm still having this problem, especially Monday morning as there are hardly any emails coming through on the weekends so I have a cron job restarting amavis and flushing the postfix queue Monday morning. But if there is a public holiday I have to manually do this :-/ So I was thinking, could this be "monitored" with monit? If it can, what should I monitor, amavisd or mysql? Any help with a monitor setup to keep this in check?
so I implemented checking of amavis with monit: Code: check process amavisd with pidfile /var/run/amavis/amavisd.pid group mail start program = "/usr/sbin/service amavis start" stop program = "/usr/sbin/service amavis stop" if failed port 10024 protocol smtp then exec /root/restart_mailserver.sh if 5 restarts within 25 cycles then timeout then I found a slightly more expanded version to do this here: https://www.howtoforge.com/communit...guration-for-debian-ispconfig-3-server.43177/ but I'm not sure if that is needed? And yet, this morning, I had many, many emails in my deferred queue so obviously this is not working :-(
embarrassingly enough, I totally panicked and simply restarted amavis and postfix so I cannot answer your question. I had put the solution with monit monitoring into place about 2 weeks ago and have since received emails from monit when amavis was not responding and was restarted by monit. It seemed all was working for 2 weeks. I'm at a total loss with this issue. I fear my email hosting customers will soon walk. Do you guys do paid issue solving? Or do you have any recommendations? Or more ideas for solving this myself?
and on top of this, after restarting amavis and flushing the postfix queue the email from my biggest client are still stuck in the deferred queue: T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 17 0 0 0 0 3 8 3 3 0 0 All other mails were sent out. All new incoming and outgoing emails are working now except these which are still stuck in the queue. All of them are saying: Code: (delivery temporarily suspended: lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once) ###edit### using postqueue -i xxxxxxx I was able to send out most of these for now.
@till: got the exact same problem right now. Stuff in the deferred queue. Amavis running. Even did a manual amavis restart. It seems the problem is between monit and amavis. also monit monitoring only port 10024 and not 10026. The log files now actually point to a problem with amavis on port 10026 not 10024: Code: [LIST] [*]Sep 10 12:01:43 alfred postfix/smtpd: 1FDEC52076E: client=jacob.ict-consult.co.za[51.254.252.92] [*]Sep 10 12:01:43 alfred postfix/cleanup: 1FDEC52076E: message-id=<[email protected]> [*]Sep 10 12:01:43 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active) [*]Sep 10 12:01:43 alfred postfix/smtp: 1FDEC52076E: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10026, delay=0.09, delays=0.05/0.03/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO) [*]Sep 10 12:04:51 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active) [*]Sep 10 12:06:52 alfred postfix/smtp: 1FDEC52076E: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10026, delay=309, delays=189/0/0/120, dsn=4.4.2, status=deferred (conversation with 127.0.0.1[127.0.0.1] timed out while sending DATA command) [*]Sep 10 12:14:02 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active) [*]Sep 10 12:17:07 alfred postfix/error: 1FDEC52076E: to=<[email protected]>, relay=none, delay=924, delays=739/185/0/0, dsn=4.4.2, status=deferred (delivery temporarily suspended: conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting) [*]Sep 10 12:21:29 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active) [/LIST] ###edit### And only a few minutes after restarting amavis, I realize the deferred queue might be empty but the active one is now full and stuck apparently. Code: Sep 10 12:33:53 alfred postfix/pickup: E59F852097B: uid=0 from=<root> Sep 10 12:33:53 alfred postfix/cleanup: E59F852097B: message-id=<[email protected]> Sep 10 12:33:53 alfred postfix/qmgr: E59F852097B: from=<[email protected]>, size=31473, nrcpt=1 (queue active) Sep 10 12:38:53 alfred postfix/smtp: E59F852097B: to=<[email protected]>, orig_to=<root>, relay=127.0.0.1[127.0.0.1]:10024, delay=300, delays=0.02/0/300/0, dsn=4.4.2, status=deferred (conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting) It looks like I have an amavis problem now: Code: root@alfred:~# /etc/init.d/amavis restart Stopping amavisd: (not running). Starting amavisd: amavisd-new. root@alfred:~# /etc/init.d/amavis status [FAIL] amavisd is not running ... failed! monit keeps saying: amavisd Execution failed I tried both: start program = "/usr/sbin/service amavis start" and start program = "/etc/init.d/amavis start" with the same result. It seems stopping and starting monit helped. I enable monitoring of amavisd in monit, then I see amavisd Initializing for a while, then monit says: amavisd Execution failed and I realized there is no /var/run/amavis/amavisd.pid all I can find is a /var/run/amavis/amavisd.lock Any help in properly monitoring both ports of amavis and figuring out the PID location? Please also see this thread where I tried debugging amavis: https://www.howtoforge.com/community/threads/debugging-amavis.73925/ in the end I fixed thigns for now by restarting the server.