Girish Venkatachalam is a UNIX hacker with more than a decade of
networking and crypto programming experience.
His hobbies include yoga,cycling, cooking and he runs his own
business. Details here:
How relevant is a good antispam solution for you?
Unsolicited bulk e-mail(UBE) or unsolicited commercial e-mail(UCE) is what is commonly known as spam. Spam nowadays has become such a common "word" that we use it with wikis, with online mail ID creation sites and so on. CAPTCHA is geared towards working around robots and programs that masquerade as humans.
We hackers sometimes turn towards evil ways and most spammer botnets are
created by very intelligent but highly immoral crackers who get paid and
sometimes paid too well for the work they do. This causes relentless
misery for you and me. In fact the problem is so far reaching that
anyone with even the slightest exposure to Internet know that p0rn mails
are a problem to deal with.
I really wonder how many people can be so naive as to expect
unreasonable sexual satisfaction or sudden billions in their bank
account. Einstein has once said that the universe and human stupidity
are both infinite and that he was not sure about the former.
Human stupidity makes people open such spammy mails that cause further
problems for their employers who are eminent baits for phishing attacks
and other social engineering attempts wrought by spammer networks.
People have a tendency to expect miracles. Just look at how many money
astrologers, sooth sayers and magic healers make. Evidently spam is big
business. Or else we will not be talking about spam. And fools continue
to fall for them. The world cannot be freed from them. So the only
option left with us is to protect them.
Comparison of various approaches
Preventing unwanted e-mails from getting into your Inbox folder is
clearly not a simple task. The smartest minds of our times have applied
themselves to this problem and people like
Prakash have helped lesser mortals like us live relatively
peacefully. It was for Vipul's razor
that Vipul was awarded the MIT
Young Innovator award in
2003. That should tell you how important the spam problem is.
Paul Graham is famous for his essays on spam and Bayesian filtering and
we find the best implementation of his strategy in spamassassin spam filter
written in perl(gulp). What I say here may hurt many of you but I will
still say it. Both Vipul's razor and spamassassin are written in perl
but spamassassin sucks. And it sucks real bad. Perl is not the language
for doing content scanning at wire speed. It can be used for prototyping
software and doing non real time work. Well, Apache modules are written
in perl but that is a different story.
I don't have a problem with spamassassin just because it is written in
perl. I don't like it because it does a very nasty job of spam control.
It is complex, slow and causes false positives, quarantines and what
not. Unfortunately it seems hugely popular. Well well.
There are many approaches taken to save vulnerable people from
clicking at nasty spammer ads. I have never fallen for domains like
paypal.us or bankofamerica.foobar.com or whatever. But there are many
who do. The best thing to do would be to not allow such mails to attract
their attention. How?
By doing it at any cost. By any cost is meant that even if you lose
legitimate mails, we cannot allow dangerous mails in. This is a measure
of desperation taken by companies that are left with no choice.
And open source solutions like spamassassin make the problem worse by
making people believe that you cannot make omelettes without breaking
eggs. If you want no spam, then you also might lose important mail. And
people don't buy products based on technical merit.
People buy products based on brand name. People buy products depending on
what is the coolest thing in town. People buy what other people buy.
They discuss with friends and like minded people and then decide. They
also want to escape responsibility and consequently they do not wish to
risk their reputation. So even if you give them nectar, they will
continue to use existing poison because they know its taste.
Known devil is better than unknown angel. But there are companies where
decisions are taken by people other than systems administrators and half
baked technicians. And companies exist which don't need to answer
somebody else about what they choose. My customer was one such. He is
the proprietor of the company and he knew a thing or two about open
source. And he somehow felt that I could be trusted.
This level of personal interaction cannot be replaced by the excellence
found in the open source world. It takes time for adoption.
But I firmly believe that ultimately if your product is good you will
win. Nature and the law of karma works inexorably and with astonishing
accuracy. The tough and the capable survive. The rest are left in the
Now let us get back to the technical problems associated with spam
control. Spam comes in various shades and colors. It is impossible to
accurately define what is spam. There is a comment often made that what
is spam for one may be ham for another. This is bullshit. Before google
came, Altavista never thought that web search should be done the way
google does. And you know the rest.
People clearly know what spam is.
Even if you are really interested in knowing about products that enhance
your private organs, if you don't ask for a mail, or subscribe to a
newsletter and if you receive it, it is clearly spam. As simple as that.
There is one definition of spam that I like a lot. Spam is bulk mail sent
by Botnets(automated programs that send out mail) to millions of
unrelated recipients. They pump traffic at such high rates that most of
the IP overloads in the Internet are caused by such criminals.
So according to me, spam is nothing but Botnet spew. I
like this term a lot. In other words, spam is what is generated by
computers and sent by computers. If humans send mail, even if a mailing
list is used or even if it is addressed to 1000s of recipients, it is
not spam. It is solicited and coming from a human. You may think of it
as spam but it is impossible to make a machine make a decision or an
algorithm come to a conclusion. Consequently we have to conclude that
spam is not unwanted mail. Spam is bulk mail or commercial mail.
The mathematics of the spam problem
Spam control math is no great shakes. For that matter even google's
search algorithms are easy to understand. It is only the details that
need genius to understand and troubleshoot. The basics of math are
always easily understood by common sense.
I learnt Bayesian probability and statistics as part of my engineering
degree long ago. It is a very simple concept. If you throw a fair die
two times, the probability of getting 6 both times is a product of the
probability of getting 6 the first time and the probability of getting 6
the second time.
Physical independence of probabilities is known by Bayesian probability theory.
See? It is not so complicated after all. If you read Paul Graham , you
will find that he has expressed the same idea and how it relates to spam
in many words. He is a great writer no doubt and his idea of applying
this simple mathematical truth to a practical and relevant problem like
spam is highly commendable. But read the next paragraph.
But the assumptions made by Paul Graham are not sound. You can have
spammy content in legitimate mais and vice versa. So no matter how
brilliant or adaptive or well performing your algorithm is, its delicate
clockwork will blow to smithereens when fed with unexpected data.
Perhaps I should tell you how Bayesian theory relates to spam control.
Two spammy words appearing close to one another have a certain
probability of occurrence in spam. And a different probability of
occurrence in legitimate mail. This can be compared to the two
probabilities of a fair die giving a result of 6 both times.
CRM114 discriminator uses another simple mathematical concept called
Markovian chains to further refine this algorithm. Whereas Bayesian
probability can only account for characters and words occurring together
in headers and mail bodies, Markovian chains have the mathematical
ability to construct databases with even sentences. Evidently this is a
lot of work and you require very powerful processor, memory and of
course disk space.
And people only talk about spam efficiency. We stop 99.9999% of the spam
messages. Now is this calculated as a percentage of total mail
received(including spam) or is it taken against the spam messages that
you did not get or is it some other metric coming from an obscure
database? God alone knows.
Moreover they conveniently ignore the
problem. Your spam
filter is great, it stops 99.999% of the spam. But you lost an important
mail for an appointment with your boss. How good is your filter?
Products usually do not mention this. They will provide you with
technical support and spend time with you but what about the lost
To mitigate this, many vendors have a concept called the 'quarantine'.
It is a wholly unnecessary overhead invented for business purposes
alone. It serves no practical purpose. Commercial interest forces people
to show how much work they are doing. And sometimes the burden falls on
your head. You should maintain the product and babysit it and manually
interfere. You should pass 'parked mail messages'.
My customer had a harrowing time even after buying my product because I
was present in the server room when his sys admin would painfully delete
the mails and pass mails in the quarantine for the other domain for
which he had not purchased my product. He had spent a lot of money
buying the product. He could not throw it into the drain after all?
The other math involved in spam control is the math of the rsync
algorithm. Or the checksum computation involved in cryptographic
signatures. If you know the MD5 or the SHA1 message digest algorithm,
you know what I mean. It is vaguely similar to symmetric encryption as
it also involves multiple rounds of similar operations in math(mainly
EX-OR and matrix multiplication), but you get a constant value as
result. With a unique mathematical property that no two inputs can give
the same output. MD5 gives a constant value of 128 bits as output and
SHA1 gives 160 bits as output. Even the slightest change in input will
cause widely varying output.
rsync uses this algorithm to detect file changes and a rolling checksum
is computed after splitting files into blocks. And Vipul's razor and DCC
use the same concept with a twist to detect spammers modifying their
messages to send to other innocent bystanders on the Internet.
The careful reader will notice that this approach has problems too. I
think that gmail uses this approach. We need a corpus of spam and this
approach necessitates manual intervention. A global database of spammy
content is required to feed the checksum computation engine and that is
what is used to prevent others from getting the spam with modifications.
This approach naturally takes us to the next section. So now we are
striving towards a better understanding of spammer mentality and
motives. We are coming closer to the real world and consequently we can
avoid the idealistic assumptions that Paul Graham's Bayesian approach
The business model of spammers
SpamCheetah and OpenBSD greylisting based products/approaches understand
the psychology of spammers and appreciate the practical side of spam
propagation and spam generation. Spam is not something that comes out of
ether and escapes into the void. It is generated by humans that set
machines into motion , and they can masquerade IP addresses, they can
masquerade sender e-mail addresses, they can fake several other things
and generate bounce messages or do backscatter, but there are certain
things that they need to strictly abide by if they want to deliver their
This fact will never change. Spammers want their message to land in your
mailbox. And they want you to open in. It may be in Korean or Chinese,
it may be image spam, it may be something else, but they cannot get
around this basic fact.
The second truth that one has to realize is that spammers do this for
money. They don't do this for fun and they don't do this for attaining
nirvana. They do this because marketing sells. And e-mails cost nothing.
They have to pay service providers and the websites the spam mails point
to help out with them.
They also face the same problems that criminals face in every country.
They have to grapple with the legal system. They have to face
opposition, complaints and sometimes even punishment. So what do they
do? They start operating in a clandestine fashion and operate in a
manner that helps them escape detection. Hence Bogons which are
unallocated IP addresses of BGP prefixes that spammers use to pump
traffic, and once the harm is done, they go to some other location and
wreak havoc. You have seen movies in which the villain has multiple
identities and passports and how they fly to other countries.
We don't get sufficient time to react. As I said before, most of the
traffic overloads in Internet routers are due to worms and spam. And if
we knew their source, we could always plug the holes. But this is easier
said than done. But we can protect ourselves with the right medicine.
What is the right medicine?
You can force spammers to pass a test which we keep for both innocent
people and criminals. We know the innocent people will pass and that the
criminals will get caught. And this is the test performed by greylisting
and tarpitting. This is further helped by IP address blacklisting and
e-mail address whitelisting. There are idiots amongst spammers who get
caught and there are databases who track such current BGP netblocks
that are known to send spam. SpamCheetah uses all the 3 approaches viz,
greylisting, tarpit and blacklist of known spammers.
But greylisting has a problem. People don't like it because it delays
the first mail from a domain that has never contacted you before. In
practice this is never a problem but people are people after all and
their anxiety needs to be addressed. And people come with baggage.
Greylisting is an old concept and until now, nobody implemented this
approach correctly. Design is one thing. Implementation yet another.
You need to mix greylisting with some pepper and salt to create the
right medicine. And this recipe is reducing the TCP window of the SMTP
dialogue. This is done by the OpenBSD tarpit. It takes genius to come up
with such an idea but it is a very powerful concept. You not only send
back the error message of 403 or 503 to the sender, you also subject the
sender to another acid test.
You reply at the rate of 1 character per second. This can be very
annoying for someone who wants to deliver million messages but for
legitimate human generated senders this is nothing. Yet another
application of understanding real life and nature better.
The greatest side effect of the science and math of OpenBSD greylisting
is that I can now give spam control in a USB stick. Not the 16 Gig one,
but in 1 GB. And you can run it in a box with tiny processing power and
memory. After all we don't do mail. We don't need hard disks. We don't
need to talk at high speed because our job is to talk slowly, and we
don't need high processing power as we don't have to do content
scanning. We only need the CPU to run our daemons that track IP
addresses in our database. We don't store too much data either. More
The other great side effect or benefit of OpenBSD greylisting is that
you don't even allow the spammer to deliver the message to you. So I
cannot prove that you would have received spam. I never receive it. I
never allow the spammer to consume my bandwidth. Now how can I prove
that I achieve x% spam catch rate? It is impossible. I save you precious
bandwidth, mailbox storage space, backup costs and free your network for
productive activity. Of course I can show that if you don't run this
filter, you receive spam. That is all.
This is a great example of technical superiority enabling unimaginable
Bayesian probability theory
CRM114 Markovian discriminator
Tagged Message Delivery Agent
OpenSPF - Sender Policy Framework
Paul Graham's essays
Ending Spam by dspam creator Zdziarski book
SpamCheetah OpenBSD tarpit in action
Youtube video of Spamcheetah Openbsd tarpit
SpamCheetah technical backgrounder
Distributed Checksum Clearinghouse
Understanding the network level behavior of spammers
Spam approaches comparison table
Got something to add? Send me email.
Increase ad revenue 50-250% with Ezoic
More Articles by Girish Venkatachalam
© 2012-07-01 Girish Venkatachalam