RSpamd training?

Discussion in 'Server Operation' started by WhitcombeRD, Nov 2, 2022.

  1. WhitcombeRD

    WhitcombeRD Member

    Just installed a fresh ISPConfig 3.2 setup on Debian 11 using rspamd instead of amavis.

    I've gone from years of playing with SpamAssasin/amavis for junk filtering so rspamd is totally new to me.

    One basic starter question - previously i had an sa-learn script to run on all users Junk folders on a cron job to learn spam that got through.
    Users just put a false negative spam into Junk and the script picked it up when run and updated.

    How do i do similar with rspamd with the ispconfig setup? I can see the manual training on the web interface but how can i get it to auto learn from each users junk folder on a periodic basis?

    Do i need extra plugins for this or does ispconfig come with a mechanism? Forum search isn't yielding any results for me sadly.
     
  2. pyte

    pyte Well-Known Member HowtoForge Supporter

    You can learn rspamd with spam too
    Code:
    rspamc learn_spam /var/vmail/domain.tld/user/Maildir/.INBOX.Junk/cur
    But rspamd does this by default while scanning anyways. You are looking for classifier-bayes it should be located in /etc/rspamd/local.d/classifier-bayes.conf, there is an option for autolearn spam, junk and ham with the corresponding trigger values. A default ISPConfig Installation should have this setup like this:
    Code:
    backend = "redis";
    servers = "127.0.0.1";
    autolearn {
      spam_threshold = 6.0; # When to learn spam (score >= threshold and action is reject)
      junk_threshold = 4.0; # When to learn spam (score >= threshold and action is rewrite subject or add header, and has two or more positive results)
      ham_threshold = -0.5; # When to learn ham (score <= threshold and action is no action, and score is negative or has three or more negative results)
      check_balance = true; # Check spam and ham balance
      min_balance = 0.9; # Keep diff for spam/ham learns for at least this value
    }
    per_user = false;
    per_language = true;
     
  3. WhitcombeRD

    WhitcombeRD Member

    Top line looks to be what i need thanks.

    Spam is getting through so it can't "learn" that without being told to.

    So i guess the old system of getting users to put missed spam in the junk folder and running the learn_spam option on a crontab will do the same job as before.
     
  4. pyte

    pyte Well-Known Member HowtoForge Supporter

    This is not correct. As you see in the config above
    Code:
    spam_threshold = 6.0; # When to learn spam
    It will learn the message as spam when the score is 6.0 or higher, however your levels for add header or reject are likly higher than this. For example if add header is 7.0 and reject is 11.0, the config will still learn messages as spam even if they are not rejected or flagged as such.
    It should work this way. May check the official documentation for further information and ways to automate the process.
     
  5. WhitcombeRD

    WhitcombeRD Member

    Ive had a few through scoring 4 and 5 so aren't going to get learnt without manual intervention, hence wanting it to learn off a folder of those it misses.
     
  6. pyte

    pyte Well-Known Member HowtoForge Supporter

    You may want to analyse these mails and see if you can finetune your rspamd scoring instead to taking the learn approche here. rspamd is quiet beefy when it comes to tackeling spam, and you can fine tune alot of things, don't treat it like amavis.

    I manage a mailserver with 23 million mails a year and use rspamd. I did alot of fine tuning through analysing spam mails and see what symboles applied to them. For example we had alot of spam that was mostly fine when it comes to technical checks, but one thing that was off with all of them was a missing TO field. I can't think of mails where i should accept mails with missing TO field as valid so i've set the "MISSING_TO" symbole to apply a score of +7. This should be the way you tackle spam, and not blindly train it with what users think is spam.
    I don't know how big your setup is, in my case we have a lot of users that send us mails with attached .eml files of what they describe as spam. Alot of these mails are just totaly valid mails, in most cases newsletters they subscribed to. Sure the mail may be unwanted by the user but that doesn't classifies the mail as spam.
     
  7. WhitcombeRD

    WhitcombeRD Member

    Reviving this thread due to a new issue.

    The above learn command worked perfectly for weeks but suddenly last week its stopped working.

    Every single mail sent to learn results in:
    Nothing has changed my end in configuration or system setup at all. A google search didnt yield any real useful information as to a cause.

    Any ideas whats happened?
     
  8. pyte

    pyte Well-Known Member HowtoForge Supporter

    Yes. rspamd reached it's learning target of 200 Tokens. No need to worry, when you activate the debug logging of rspamd the error logs as the following:
    Code:
    rspamd_stat_classifier_is_skipped: learn condition for classifier bayes returned: already in class spam; probability 100.00%; skip classifier
    The error you get sounds worrying but yea nothing to worry on your side just bad wording from the rpsamd devs i guess :)
     
  9. WhitcombeRD

    WhitcombeRD Member

    Makes sense after a few weeks of learning spam it missed.
    Does that mean its not learning any of the new (and increasing) spam im trying to train it though?
    Ultimately im training it on messages all of which got falsely classed as ham on arrival.
     
  10. pyte

    pyte Well-Known Member HowtoForge Supporter

    It just means that the message is already classified as either ham or spam so it skips to learn
     
  11. WhitcombeRD

    WhitcombeRD Member

    OK...so next question. If its already classified it as ham, how do i reclassify as spam?
    Im assuming there's a better method than feeding it manually through the web interface?
     
  12. pyte

    pyte Well-Known Member HowtoForge Supporter

    I am afraid this is not possible at the time being. However a feature request is open and can be found here: https://github.com/rspamd/rspamd/issues/3600

    //EDIT: From the feature request, it seems to be possible as of now with:
    Code:
    rspamc learn_spam --header 'Learn-Type: bulk' 'MESSAGEID'
     
  13. WhitcombeRD

    WhitcombeRD Member

    Cheers i'll give it a go.
    Failing that will go back to Spam Assassin which can be trained as the volumes im getting through now all lowish score (5.9 or so) are increasing by the day.
     
  14. till

    till Super Moderator Staff Member ISPConfig Developer

    A score of 5.9 is quite high and not low. Maybe you missed adjusting the spam score in the policies when switching from amavis to rspamd? Rspamd in general is way more accurate than amavis.
     
  15. WhitcombeRD

    WhitcombeRD Member

    Its a default policy on a fresh install (i didnt migrate, ispconfig clean install).
    Default spam threshold is set to 8. Quite a few hams arriving in the 5-7 range as well.

    This worked very well for about a month and was learning the few that missed weekly on a scrip. Lately the learning failed (as above) but also a huge increase in spams previously marked higher now getting through.
     
  16. till

    till Super Moderator Staff Member ISPConfig Developer

    The learning did not fail. What Rspamd told you is that it knows that specific token is already in the spam category, so there is no need and no benefit to learning it again as spam. Manual training can also cause things to go wrong. That's why I do not do that on my systems. Rspamd learns on its own and considers that the threshold between ham and spam is significant enough for that.

    And the threshold for spam is different for each system, so it needs to be adjusted by the system administrator after installation to match his needs.
     
  17. WhitcombeRD

    WhitcombeRD Member

    Having the tokens in the category really doesn't seem to work though given the same messages arrive again and again without the score changing (and they're flagged as ham every time).
     
  18. pyte

    pyte Well-Known Member HowtoForge Supporter

    I second what till said, rspamd is way better at dealing with spam than spamassassin/amavis, and i've used both with mailserver up to 22 million mails a year.
    I don't see the need to learn spam/ham manually as i don't understand what you are trying to gain here. If you want to tweak settings check why mails that you consider "spam" get accepted and may regulate the symbols in question to better fit your needs.
     
  19. WhitcombeRD

    WhitcombeRD Member

    What im trying to achieve is at least a similar level of spam detection accuracy i had on the old spam assassin and the ability to get the thing to actually learn from the (significant) amount of spam it fails to flag.
    With SA on the few it missed id just have a cron job training it and the mails then usually got flagged and recognition improved.

    [​IMG]

    For example from one mailbox. This is about 24hrs worth. Can see its flagged one as spam (top one) with a score of 9.79.
    All of the others have the same score (5.39)
    In the actual web interface history all i see is soft-rewrite and greylist. The tokens look like this:
     
  20. till

    till Super Moderator Staff Member ISPConfig Developer

    Okay, so you can see from there that Rspamd correctly recognized it as Spam. The Bayes filter gives a score of up to 5.1 points and the score is "BAYES_SPAM (5.1) [100.00%]", so the self-learning filter scored at 100%, which seconds what I posted above that the filter already learned this as spam and gives it a 100% spam score. No other major rules were triggered by this email, so the self-learning filter score is the most relevant part of the overall score. What happens in the end with a given score is up to you and the settings you choose in the policy, as mentioned in my post above.
     
    pyte likes this.

Share This Page