Training your users to properly train Bayes


Bayesian filtering is part of Kerio's email spam control. While it is important to understand that it is just one part, it's also important to understand the strengths and weaknesses inherent in this type of filtering.

By itself, Bayesian spam filtering would have a difficult task. It would know nothing about what is spam and what is not and could only learn from what users tell it. As it is just looking at words in an email, if you tell it that some particular email is spam, that isn't really very helpful. Which word or phrase in the mail caused you to say that? The Bayes engine doesn't know - it will only be after it has seen numerous emails marked as spam that it can start to guess that you don't want to see emails about male pattern baldness or refinancing your mortgage and so on. It takes hundreds of submissions to train the system and in the meantime, you get the spam.

As noted, however, it doesn't operate in a vacuum. The SpamAssassin filtering in Kerio already knows many things about spam and that helps train the Bayes engine also. Therefore, when reading what follows here, keep that firmly in mind: Bayesian filtering is not entirely dependent on user input.

User input ignored?

Sometimes it may seem to users that their input has NO effect on spam filtering. They mark something as spam again and again but it keeps showing up in their mailbox. That could be because training Bayes takes time - marking a dozen messages may seem sufficient to the user, but that may well be just statistical noise to the Bayes engine. It may need many more examples before it can understand what you are objecting to.

There is also the issue of a shared Bayesian database. Some systems maintain separate databases for each user, but Kerio (and most systems) use one database that gloms all Bayesian knowledge together. There are advantages and disadvantages to both approaches. For one thing, taking all users together means faster training. If the grouped Bayes system has learned that "Buy VIA*GRA" is likely to be spam, it doesn't matter if you haven't seen such a message yet: when you do, it will be marked as spam. If the database was maintained on an individual basis (and assuming that SpamAssassin filtering did not help Bayes learn), you'd get that email because it has never been seen in your database.

Bayes learns what ISN'T spam also. If you think "Christian Singles" is spam but the rest of your unmarried and fervently religious office does not, your marking these messages as spam may have no consequence. SpamAssassin comes into this also - it assigns negative spam points for things it thinks are definitely not spammish. With group based Bayes, your opinion matters, but it may not matter very much.

Bad training

It's not easy to train users to help out with spam filtering. Some users just don't want to bother with clicking the Spam button or dragging messages to the Junk Mail folder. Other users embrace the idea enthusiastically - sometimes too enthusiastically:

(from How URL Spam Filtering Beats Bayesian/Heuristics Hands Down)

.. approximately one-half of the emails reported by customers as "spam" are not spam. Airline mileage reports are routinely reported as spam, follow-up emails from well known companies, and much more are reported as spam. Since these spam reports are rejected by our trained staff, they do not affect our accuracy. Bayesian systems that accept these inaccurate spam reports without trained human review will be highly skewed.

(Note that URL filtering and URL blacklists are part of Kerio Connect spam filtering)

As explained at I've been training the spam filter but I'm still receiving the same spam emails marking something as spam that the system is sure is NOT spam (and vice versa) is unlikely to have much effect. Bayes is not just looking at the Subject line or the sender; it is looking at overall characteristics. There are other tools in Kerio Connect for getting rid of specific subjects and senders. Users need to understand that or they may be unnecessarily frustrated.

If enough of them are marking non-spam as spam, they may inadvertently poison the Bayesian filters. This can cause other innocent email to be considered spam. As this poisoning can also come from real spammers, the cumulative damage can be bad enough to require a fresh start by resetting the entire database. See Resetting the Bayes in Optimizing spam protection in Kerio MailServer.

Although that sounds draconian, a busy system will quickly re-learn your users desires. Do be careful to explain to them that only true spam should be marked: if they have signed up for mailings from XYZ and now do not wish to receive those, the solution is to tell XYZ to stop sending mail, not to mark those emails a spam!

I hope this helps. Spam filtering is complicated and can be quite annoying. Understanding more about its inner workings can lessen confusion.

Got something to add? Send me email.

(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Training your users to properly train Bayes

Increase ad revenue 50-250% with Ezoic

More Articles by

Find me on Google+

© Anthony Lawrence

Kerio Connect Mailserver

Kerio Samepage

Kerio Control Firewall

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us

The errors which arise from the absence of facts are far more numerous and more durable than those which result from unsound reasoning respecting true data. (Charles Babbage)

I define UNIX as 30 definitions of regular expressions living under one roof. (Donald Knuth)

This post tagged: