SOLVED: amavisd-new connections timing out

Discussion in 'Server Operation' started by Ovidiu, Apr 13, 2016.

  1. Ovidiu

    Ovidiu Active Member

    This is a perfect server Debian Jessie system running ISPCFG3 + nginx with the usual postfix dovecote amavisd-new spamassassin setup

    After an upgrade from mysql 5.6 to 5.7 which went slightly wrong but seems fixed now I am now experiencing timeouts with amavis meaning every email is being deferred for minutes up to hours. I'd be grateful for any pointers. Funny enough, nothing was changed when it comes to the mailing system so it must be related to the mysql upgrade.
    I have changed the log level for amavis to 5 and couldn't spot any errors so its down to 4 now but still nothing jumps out at me.

    Code:
    message deferral detail
    -----------------------
      error (total: 110)
            86   lost connection with 127.0.0.1[127.0.0.1] while sending end of...
            11   lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO
             9   conversation with 127.0.0.1[127.0.0.1] timed out while sending...
             2   address resolver failure
             2   conversation with 127.0.0.1[127.0.0.1] timed out while receivi...
      smtp (total: 590)
           349   lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO
           171   lost connection with 127.0.0.1[127.0.0.1] while sending end of...
            33   Host not found, try again
            14   conversation with 127.0.0.1[127.0.0.1] timed out while sending...
             9   conversation with 127.0.0.1[127.0.0.1] timed out while receivi...
    Syslog also notes some of these DB issues but I have tested the user/password for dbispconfig inside that file and they are ok. This coudl explain the deferred emails but how to fix or debug?
    Code:
    Apr 13 07:43:00 alfred postfix/smtp[30110]: warning: mysql:/etc/postfix/mysql-virtual_relaydomains.cf: table lookup problem
    
    And this is what I did when upgrading mysql:
    Code:
    dpkg status changes
    
    Installed:
      libio-socket-ip-perl:all 0.32-1
      libmecab2:amd64 0.996-1.1
      mysql-server:amd64 5.7.12-1debian8
      mysqltuner:all 1.3.0+git20140812-1
    Reinstalled:
      mysql-apt-config:all 0.7.0-1
    Upgraded:
      libmysqlclient18:amd64 5.6.29-1debian8 => 5.6.30-1debian8
      libmysqlclient18:amd64 5.6.30-1debian8 => 5.7.5-m15-2debian8
      linux-libc-dev:amd64 3.16.7-ckt25-1 => 3.16.7-ckt25-2
      mysql-apt-config:all 0.5.3-1 => 0.7.0-1
      mysql-client:amd64 5.6.29-1debian8 => 5.6.30-1debian8
      mysql-client:amd64 5.6.30-1debian8 => 5.7.12-1debian8
      mysql-common:amd64 5.6.29-1debian8 => 5.6.30-1debian8
      mysql-common:amd64 5.6.30-1debian8 => 5.7.12-1debian8
      mysql-community-client:amd64 5.6.29-1debian8 => 5.6.30-1debian8
      mysql-community-client:amd64 5.6.30-1debian8 => 5.7.12-1debian8
      mysql-community-server:amd64 5.6.29-1debian8 => 5.6.30-1debian8
      mysql-community-server:amd64 5.6.29-1debian8 => 5.7.12-1debian8
      mysql-server:amd64 5.6.29-1debian8 => 5.6.30-1debian8
    Removed:
      mysql-community-server:amd64 5.6.29-1debian8
      mysql-server:amd64 5.6.30-1debian8
     
    Last edited: Apr 13, 2016
  2. till

    till Super Moderator Staff Member ISPConfig Developer

    Please try the mysql settings that I mentioned in your other thread. without that setting (or with strict_trans_tables), ispconfig 3.0.5. will fail as well with mysql 5.7.
     
  3. Ovidiu

    Ovidiu Active Member

    You lost me a bit with " strict_trans_tables" - I googled and apparently its default with mysql starting 5.7.5. I run MySQL on 127.0.0.1 (5.7.12)

    Any more advice how and where to check this?
    ###edit###
    double checked this is in my my.cnf
    Code:
    [mysqld]
    # added after upgrade to mysql 5.7
    sql_mode=NO_ENGINE_SUBSTITUTION
    so no strict_trans_tables there.
     
    Last edited: Apr 13, 2016
  4. Ovidiu

    Ovidiu Active Member

    any other advice to sort out this amavis / mysql issue or at least properly debug it?
     
  5. till

    till Super Moderator Staff Member ISPConfig Developer

    I did not had this issue yet and I switched all newer servers to mariadb, so I dont use MySQL 5.7. Maybe the new MySQL version has too short timeouts, did you try to set longer timeouts? Another possibility is that this is a false error or in other words, one that dont matter. Do you have actaul mail problems or just log messages about timeouts, it might be that amavis tries to keep the connection open but mysql closes it after a longer idle period, in this case, amavis should create a new connection automatically, but you may see such errors in the log without a real issue behind them.
     
  6. Ovidiu

    Ovidiu Active Member

    hm, I'll look into the timeouts, thanks but the server isn't really busy at all...

    shall I look at mysql timeouts (current settings)
    wait_timeout = 60
    interactive_timeout = 180

    or amavis timeouts? I found these 2 so far:
    $child_timeout
    $smtpd_timeout


    THERE ARE REAL issues as every incoming and outgoing email is deferred for a while. I've been restarting amavis and doing postqueue -f regularly today, otherwise emails got stuck for minutes up to hours.
     
  7. till

    till Super Moderator Staff Member ISPConfig Developer

    The timeout problem is an issue on servers with low load (it happens when no new mail comes in within the specified timeout period), so when the timeout is e.g. 1 minute and you don't receive a message every minute, then you might get such an error. At a higher load, the connection will not time out as it don't gets idle too long.

    The timeouts are in MySQL, not amavis. Try to increase both to e.g. 5 minutes and check if this changes the behaviour or if you get less timeouts.
     
  8. Ovidiu

    Ovidiu Active Member

    Ah, I understand. I have now changed the 2 mysql timeouts to 5 minutes as advised and will observe and report back here.
     
  9. Ovidiu

    Ovidiu Active Member

    ok, so those timeouts didn't help at all.
    additionally I noticed that inside /etc/postfix/master.cf I had
    Code:
    amavis unix - - - - 2 smtp
            -o smtp_data_done_timeout=2400
            -o smtp_send_xforward_command=yes
    but inside /etc/amavis/conf.d/50-user was
    Code:
    $max_servers = 20;
    and I read that those must match so I adjusted it to
    Code:
    $max_servers = 2;
    and yet I see mails being deferred, here is one example:
    Code:
    root@alfred:/# cat /var/log/mail.log | grep 40BE9520D3A
    Apr 13 17:13:03 alfred postfix/smtpd[26559]: 40BE9520D3A: client=za-smtp-delivery-153.mimecast.co.za[41.74.201.153]
    Apr 13 17:13:03 alfred postfix/cleanup[26677]: 40BE9520D3A: message-id=<[email protected]>
    Apr 13 17:13:04 alfred postfix/qmgr[8500]: 40BE9520D3A: from=<[email protected]>, size=20857, nrcpt=1 (queue active)
    Apr 13 17:13:04 alfred postfix/smtp[26683]: 40BE9520D3A: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10024, delay=1.5, delays=1.5/0/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO)
    Apr 13 17:23:00 alfred postfix/qmgr[8500]: 40BE9520D3A: from=<[email protected]>, size=20857, nrcpt=1 (queue active)
    Apr 13 17:23:00 alfred postfix/smtp[29541]: 40BE9520D3A: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10024, delay=598, delays=598/0/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO)
    Apr 13 17:33:00 alfred postfix/qmgr[8500]: 40BE9520D3A: from=<[email protected]>, size=20857, nrcpt=1 (queue active)
    Apr 13 17:33:00 alfred postfix/smtp[32666]: 40BE9520D3A: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10024, delay=1197, delays=1197/0/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO)
    Any more advice?

    ###edit###
    Checking for listening amavis processes, shouldn't they be "listening" instead of close_wait?
    Code:
    root@alfred:~# netstat -a | grep 10024
    tcp        0      0 localhost.localdo:10024 *:*                     LISTEN
    tcp        1      0 localhost.localdo:10024 localhost.localdo:39738 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:43624 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:43344 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:34184 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:38856 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:59782 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:35630 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:50686 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:58866 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:53602 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:32970 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:44712 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:59774 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:36006 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:36554 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:58864 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:48692 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:39708 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:39054 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:38282 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:50688 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:33054 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:34992 CLOSE_WAIT
    tcp    16740      0 localhost.localdo:10024 localhost.localdo:41910 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:39058 CLOSE_WAIT
    tcp        1      0 localhost.localdo:10024 localhost.localdo:38272 CLOSE_WAIT
    tcp6       0      0 localhost:10024         [::]:*                  LISTEN
     
    Last edited: Apr 13, 2016
  10. Ovidiu

    Ovidiu Active Member

    Sorry to insist but I'm stuck here.
    Any more ideas?
    Shall I up the timeouts of MYSQl even more?
    Will this not negatively influence its performance? (I don't really care about RAM right now, the machine has plenty and handles very low email/web volumes)
     
  11. Ovidiu

    Ovidiu Active Member

    FIXED. I got this under control by slowly increasing the wait times for mysql until I hit the sweet spot with these settings:
    Code:
    wait_timeout = 3600
    interactive_timeout = 3600
    Next I'll probably slowly decrease the interactive_timeout as an amavis connection will definitely not take that long :)
    I don't think this will be any problem in the future as this is a very low-volume mail and web server.

    Any further input would be appreciated.
     
  12. Ovidiu

    Ovidiu Active Member

    The problem is back, I've been restarting amavis once every half an hour and then doing a postqueue -f the whole day.

    Any more pointers are appreciated. This problem only surfaced today, not aware of anythign that could have triggered it :-(
     
  13. Ovidiu

    Ovidiu Active Member

  14. till

    till Super Moderator Staff Member ISPConfig Developer

    ensure that you set the connection count in postfix master.cf to 1 as well.
     
  15. Ovidiu

    Ovidiu Active Member

    I'm still having this problem, especially Monday morning as there are hardly any emails coming through on the weekends so I have a cron job restarting amavis and flushing the postfix queue Monday morning. But if there is a public holiday I have to manually do this :-/

    So I was thinking, could this be "monitored" with monit?
    If it can, what should I monitor, amavisd or mysql?
    Any help with a monitor setup to keep this in check?
     
  16. Ovidiu

    Ovidiu Active Member

    so I implemented checking of amavis with monit:
    Code:
    check process amavisd with pidfile /var/run/amavis/amavisd.pid
    group mail
    start program = "/usr/sbin/service amavis start"
    stop program = "/usr/sbin/service amavis stop"
    if failed port 10024 protocol smtp then exec /root/restart_mailserver.sh
    if 5 restarts within 25 cycles then timeout
    then I found a slightly more expanded version to do this here: https://www.howtoforge.com/communit...guration-for-debian-ispconfig-3-server.43177/
    but I'm not sure if that is needed?

    And yet, this morning, I had many, many emails in my deferred queue so obviously this is not working :-(
     
  17. till

    till Super Moderator Staff Member ISPConfig Developer

    So amavis was not running again? Or was it running but stopped to respond on its ports?
     
  18. Ovidiu

    Ovidiu Active Member

    embarrassingly enough, I totally panicked and simply restarted amavis and postfix so I cannot answer your question.

    I had put the solution with monit monitoring into place about 2 weeks ago and have since received emails from monit when amavis was not responding and was restarted by monit. It seemed all was working for 2 weeks.

    I'm at a total loss with this issue. I fear my email hosting customers will soon walk. Do you guys do paid issue solving? Or do you have any recommendations?
    Or more ideas for solving this myself?
     
  19. Ovidiu

    Ovidiu Active Member

    and on top of this, after restarting amavis and flushing the postfix queue the email from my biggest client are still stuck in the deferred queue:
    T 5 10 20 40 80 160 320 640 1280 1280+
    TOTAL 17 0 0 0 0 3 8 3 3 0 0

    All other mails were sent out. All new incoming and outgoing emails are working now except these which are still stuck in the queue. All of them are saying:

    Code:
    (delivery temporarily suspended: lost connection with 127.0.0.1[127.0.0.1] while sending end of data -- message may be sent more than once)
    ###edit###
    using postqueue -i xxxxxxx I was able to send out most of these for now.
     
  20. Ovidiu

    Ovidiu Active Member

    @till:

    got the exact same problem right now. Stuff in the deferred queue. Amavis running. Even did a manual amavis restart. It seems the problem is between monit and amavis. also monit monitoring only port 10024 and not 10026.

    The log files now actually point to a problem with amavis on port 10026 not 10024:
    Code:
    [LIST]
    [*]Sep 10 12:01:43 alfred postfix/smtpd: 1FDEC52076E: client=jacob.ict-consult.co.za[51.254.252.92]
    [*]Sep 10 12:01:43 alfred postfix/cleanup: 1FDEC52076E: message-id=<[email protected]>
    [*]Sep 10 12:01:43 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active)
    [*]Sep 10 12:01:43 alfred postfix/smtp: 1FDEC52076E: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10026, delay=0.09, delays=0.05/0.03/0/0.01, dsn=4.4.2, status=deferred (lost connection with 127.0.0.1[127.0.0.1] while sending RCPT TO)
    [*]Sep 10 12:04:51 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active)
    [*]Sep 10 12:06:52 alfred postfix/smtp: 1FDEC52076E: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10026, delay=309, delays=189/0/0/120, dsn=4.4.2, status=deferred (conversation with 127.0.0.1[127.0.0.1] timed out while sending DATA command)
    [*]Sep 10 12:14:02 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active)
    [*]Sep 10 12:17:07 alfred postfix/error: 1FDEC52076E: to=<[email protected]>, relay=none, delay=924, delays=739/185/0/0, dsn=4.4.2, status=deferred (delivery temporarily suspended: conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting)
    [*]Sep 10 12:21:29 alfred postfix/qmgr: 1FDEC52076E: from=<whatsnew-bounces+denis=localrecipientdomain.co.za@list.gcis.gov.za>, size=14113, nrcpt=1 (queue active)
    [/LIST]
    
    ###edit###
    And only a few minutes after restarting amavis, I realize the deferred queue might be empty but the active one is now full and stuck apparently.
    Code:
    Sep 10 12:33:53 alfred postfix/pickup:  E59F852097B: uid=0 from=<root>
    Sep 10 12:33:53 alfred postfix/cleanup:  E59F852097B: message-id=<[email protected]>
    Sep 10 12:33:53 alfred postfix/qmgr:  E59F852097B: from=<[email protected]>, size=31473, nrcpt=1 (queue active)
    Sep 10 12:38:53 alfred postfix/smtp:  E59F852097B: to=<[email protected]>, orig_to=<root>, relay=127.0.0.1[127.0.0.1]:10024, delay=300, delays=0.02/0/300/0, dsn=4.4.2, status=deferred (conversation with 127.0.0.1[127.0.0.1] timed out while receiving the initial server greeting)
    It looks like I have an amavis problem now:
    Code:
    root@alfred:~# /etc/init.d/amavis restart
    Stopping amavisd: (not running).
    Starting amavisd: amavisd-new.
    root@alfred:~# /etc/init.d/amavis status
    [FAIL] amavisd is not running ... failed!
    monit keeps saying: amavisd Execution failed
    I tried both: start program = "/usr/sbin/service amavis start" and start program = "/etc/init.d/amavis start" with the same result. It seems stopping and starting monit helped. I enable monitoring of amavisd in monit, then I see amavisd Initializing for a while, then monit says: amavisd Execution failed and I realized there is no /var/run/amavis/amavisd.pid

    all I can find is a /var/run/amavis/amavisd.lock

    Any help in properly monitoring both ports of amavis and figuring out the PID location?
    Please also see this thread where I tried debugging amavis: https://www.howtoforge.com/community/threads/debugging-amavis.73925/
    in the end I fixed thigns for now by restarting the server.
     
    Last edited: Sep 10, 2016

Share This Page