Email Filtering - pragmatism vs accuracy

Background
We (Blacknight Solutions) have been offering email filtering to our clients since early 2002. We first began “experimenting” with spam filtering as we saw that the problem of spam/uce was growing exponentially and neither we nor our clients wanted to have our inboxes taken over by rubbish.
For the first 10-12 months after implementing server-side filtering we did not block email, as we preferred to merely tag it and deliver it. By tagging the subject line of emails in a consistent manner our clients were able to filter potential spam into another “folder” for examination.
After our initial tagging period, which involved constant tweaking of the scoring criteria, we moved from tagging to storing.
Currently we offer email filtering at different levels to our clients. At the lower end of the scale the clients’ email is scanned and stored by us without any user intervention ie. no customised black/white listing etc., while at the higher end customisable rules and criteria are implemented.
Scope and motivation of this article
Over the past 6 to 12 months the subject of email filtering has begun to attract more publicity both in “techie” circles and amongst the general public. One of the reasons for writing this article is to address some of the common misconceptions about email filtering and best practices. After following many of the discussions on technical mailing lists and bulletin boards over the last few months the author feels strongly that some people’s approach to email filtering is both misinformed and dangerous.
Due to the scope of the subject matter this article will probably be split into a number of shorter articles ie. parts, but comments from readers will be welcomed.
This article will address some of the issues involved in implementing email filtering for business and discuss some of the methods currently being used both in industry in general and by the author.
Due to the nature of our service the finer details of our setup will not be revealed, but general criteria and methodology will be discussed.
Any opinions expressed in this article are the author’s and are based on the author’s experiences.
Definitions
In order to avoid confusion a number of terms should be defined for the purposes of this article.
UCE: unsolicited commercial email
For many people there is no clear difference between the two. However a number of things may give some indication. If the sender of the email makes it clear where they obtained the email address and how you may be removed from the list it is helpful, although there is a very valid argument about unsubscribing from lists to which one was never subscribed. Why should the onus be on the recipient? It also informs the sender that the email address is valid. In my case I can usually tell if an email address has been scraped or not based purely on the address. A number of my older email aliases have not been used for at least two years due to the volume of spam that they were receiving. As a result I can safely say that any mail received to info@ is spam, as the address has not appeared on our website for at least two years, nor have I used it for at least that period. This is not a matter of a spam trap but more a simple case of applied logic. The only way you could get that address is through a spammers’ database.
spam: If you look at the variety of definitions offered by Google for this term you should immediately see part of the problem. Depending on who you talk to scope of the definition can change quite dramatically. In simplest terms it may be best to refer to “spam” as unwanted commercial email ie. mail being sent on bulk offering you commercial services that you do not want. Even that definition is not very clear, but it may help as a starting point. The type of spam that causes most problems for business is adult in nature and may vary from the extreme hardcore porn variety through to the adverts for sexual aids both herbal, chemical and physical.
Tools
There are an ever increasing number of tools and services available to help you block spam/uce available on the market. These can be divided into two groups:
client-side: The software resides on the user’s pc. It may be an independent piece of software or an addon to an email client. For example email clients such as Outlook 2003 and Eudora include spam filtering tools. Although client-side tools have their merits they do not address the primary issue with spam, which is the cost in both time and resources in downloading unwanted email. For this reason I believe that we should focus on server-side solutions. Another issue with client-side applications is that they do not update often enough, so they cannot address the issues that each new wave of spam brings.
Server-side: As the name suggests these are tools that work directly on the mail server. The advantages to using server-side tools are numerous. By blocking/filtering mail on the server you move the administrative responsibility away from the user to the server admin and their choice of tools. ISPs and hosting companies’ mail servers are connected to the ‘net 24/7 via high bandwidth connections, so although the level of unwanted email may incur a varying level of resource usage at the server level this will have significantly less impact than the resource usage at the client level.
Unlike client-side tools those used server-side have the ability to update not only in realtime but also through collaboration with other servers and through the usage patterns of the users being served.
Common Problems and misconceptions
There are a number of problems facing any provider of email filtering.

Technology
Client expectations
Accuracy
Contractual issues

Technology affects both the tools being used to stop the spam and the those being used by the spammers to send the email. Both are in constant evolution. A number of examples spring to mind:
Habeas: Users of habeas’ system are allowed to include a number of lines in their outgoing email which shows that they are valid email users. A number of spam blocking tools, such as SpamAssassin, allocate a negative score to emails with this header. Unfortunately spammers became aware of this “hole” and started using it as a way of getting mail past people’s filters. The only viable solution was to adjust the negative scoring assigned to the habeas headers to compensate.
Bayesian filters: Believed by many to be one of the most powerful weapons in the spam fighter’s arsenal Bayesian filters score mail based on the frequency of words in ham and spam:
The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham. If I’ve handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the words penis and enlargment, the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message. (taken from http://wiki.apache.org/spamassassin/BayesInSpamAssassin)
Needless to say spammers realised this and began to use a counter technique to mess up the Bayes databases (Bayes poisoning)
Other issues that we have identified are simply related to the use of outdated tools. Older versions of spam assassin, for example, will falsely score email from Outlook 2003 as spam, as the outlook version string appended to the header is not known to it. Although this issue is easily addressed via an upgrade to version 2.6* or later many corporate users relay on 3rd party vendors for email filtering. If the 3rd party in question is negligent email may be lost.
Client Expectations are, in many cases, one of the biggest issues.
Many clients expect a spam filter to be 100% effective. This is not possible and anybody who says otherwises is either naive or foolish.
No matter what technology you use to filter mail you will always have to balance the likelihood of getting false positives (ham marked as spam) versus false negatives (spam marked as ham). If you wish to reduce the level of false negatives to zero you will get false positives. Why? There are a number of reasons for this, including badly formed mail, blacklisted netblocks or spam like content, to name but a few. A more pragmatic approach is to push the boundaries as far as possible in order to minimise the risk of false positives. If you approach it in the other direction you run the risk of losing valid email.
The loss of valid email when you are filtering mail in a business environment is simply unacceptable. Although those in the industry are aware of the inherent unreliability of email as a communication method end users expect it to simply work. As many businesses rely on email as one of the primary forms of communication with their suppliers and customer base any delays or problems can have economic consequences.
Tools of the trade
MailScanner
Mailscanner is an award winning mail filtering/scanning package which is capable of complex scanning of email, both inbound and outbound. The default configuration is sane, however we have extended and customised ours extensively allowing us to implement email filtering to our taste. MailScanner makes use of the Spam Assassin libraries but it does NOT use spamc or spamd.
Spam Assassin
Current version available is 2.64, although most people would be using 2.63 as the 2.64 release was only a couple of days ago. Version 3 is currently in development and should be released within the next few weeks. The 3 series will bring a number of radical changes and improvements to the engine, but as it is still beta / release candidate it is not being used on many production systems.
The documentation available on the SA website is comprehensive and covers installation and configuration for both server-wide and per user installation.
Blacklists RBLs
Often referred to as blacklists, RBLs, realtime blacklists or DNS blacklists. These can be used directly at the MTA level or via MailScanner, spam asssassin or similar.
The problem with RBLs is that they change all the time. Of course this is the primary reason why people use them, but if you opt to block based on RBLs you are asking for trouble.
One of the most common problems I have seen with users of RBLs stems from either ignorance or admins being naive. If you are going to implement an RBL you should know what it does and why it does it. If I use an RBL I have a reason for doing so, but that may not suit your particular environment and vice-versa. Before you add an RBL into your mix you should take the time to visit the RBL’s homepage and read up a little. Find out what the listing and delisting criteria are. If they seem sane then you may choose to use it, if you have any doubts then don’t use it.
Unfortunately a lot of the spam organisations choose to target RBLs either via DDOS or legally. If an RBL becomes inactive then its database is no longer of any use to you, or it may even damage your ability to filter mail, as was the case when a well-known RBL decided to blacklist all mail.
RFC Ignorant, for example, is not a good choice as they have blacklisted the entire IE ccTLD as well as any domain that does not meet their criteria.
Other RBLs, such as spamhaus, use quite sane criteria, however due to ISPs inaction large blocks of innocent Ips may be listed.
Simply put: If you choose to score based on RBLs you will see good results. If you choose to block based on them you are shooting yourself in the foot.

Email Filtering – pragmatism vs accuracy

Related Posts:

Related Posts:

Reader Interactions

Leave a Reply

Footer