Girish Venkatachalam is a UNIX hacker with more than a decade of
networking and crypto programming experience.
His hobbies include yoga,cycling, cooking and he runs his own
business. Details here:
Unsolicited bulk e-mail(UBE) or unsolicited commercial e-mail(UCE) is what is commonly known as spam. Spam nowadays has become such a common "word" that we use it with wikis, with online mail ID creation sites and so on. CAPTCHA is geared towards working around robots and programs that masquerade as humans.
We hackers sometimes turn towards evil ways and most spammer botnets are created by very intelligent but highly immoral crackers who get paid and sometimes paid too well for the work they do. This causes relentless misery for you and me. In fact the problem is so far reaching that anyone with even the slightest exposure to Internet know that p0rn mails are a problem to deal with.
I really wonder how many people can be so naive as to expect unreasonable sexual satisfaction or sudden billions in their bank account. Einstein has once said that the universe and human stupidity are both infinite and that he was not sure about the former.
Human stupidity makes people open such spammy mails that cause further problems for their employers who are eminent baits for phishing attacks and other social engineering attempts wrought by spammer networks.
People have a tendency to expect miracles. Just look at how many money astrologers, sooth sayers and magic healers make. Evidently spam is big business. Or else we will not be talking about spam. And fools continue to fall for them. The world cannot be freed from them. So the only option left with us is to protect them.
Preventing unwanted e-mails from getting into your Inbox folder is clearly not a simple task. The smartest minds of our times have applied themselves to this problem and people like Paul Graham and Vipul Ved Prakash have helped lesser mortals like us live relatively peacefully. It was for Vipul's razor that Vipul was awarded the MIT Young Innovator award in 2003. That should tell you how important the spam problem is.
Paul Graham is famous for his essays on spam and Bayesian filtering and we find the best implementation of his strategy in spamassassin spam filter written in perl(gulp). What I say here may hurt many of you but I will still say it. Both Vipul's razor and spamassassin are written in perl but spamassassin sucks. And it sucks real bad. Perl is not the language for doing content scanning at wire speed. It can be used for prototyping software and doing non real time work. Well, Apache modules are written in perl but that is a different story.
I don't have a problem with spamassassin just because it is written in perl. I don't like it because it does a very nasty job of spam control. It is complex, slow and causes false positives, quarantines and what not. Unfortunately it seems hugely popular. Well well.
There are many approaches taken to save vulnerable people from clicking at nasty spammer ads. I have never fallen for domains like paypal.us or bankofamerica.foobar.com or whatever. But there are many who do. The best thing to do would be to not allow such mails to attract their attention. How?
By doing it at any cost. By any cost is meant that even if you lose legitimate mails, we cannot allow dangerous mails in. This is a measure of desperation taken by companies that are left with no choice.
And open source solutions like spamassassin make the problem worse by making people believe that you cannot make omelettes without breaking eggs. If you want no spam, then you also might lose important mail. And people don't buy products based on technical merit.
People buy products based on brand name. People buy products depending on what is the coolest thing in town. People buy what other people buy. They discuss with friends and like minded people and then decide. They also want to escape responsibility and consequently they do not wish to risk their reputation. So even if you give them nectar, they will continue to use existing poison because they know its taste.
Known devil is better than unknown angel. But there are companies where decisions are taken by people other than systems administrators and half baked technicians. And companies exist which don't need to answer somebody else about what they choose. My customer was one such. He is the proprietor of the company and he knew a thing or two about open source. And he somehow felt that I could be trusted.
This level of personal interaction cannot be replaced by the excellence found in the open source world. It takes time for adoption.
But I firmly believe that ultimately if your product is good you will win. Nature and the law of karma works inexorably and with astonishing accuracy. The tough and the capable survive. The rest are left in the lurch.
Now let us get back to the technical problems associated with spam control. Spam comes in various shades and colors. It is impossible to accurately define what is spam. There is a comment often made that what is spam for one may be ham for another. This is bullshit. Before google came, Altavista never thought that web search should be done the way google does. And you know the rest.
People clearly know what spam is.
Even if you are really interested in knowing about products that enhance your private organs, if you don't ask for a mail, or subscribe to a newsletter and if you receive it, it is clearly spam. As simple as that.
There is one definition of spam that I like a lot. Spam is bulk mail sent by Botnets(automated programs that send out mail) to millions of unrelated recipients. They pump traffic at such high rates that most of the IP overloads in the Internet are caused by such criminals.
So according to me, spam is nothing but Botnet spew. I like this term a lot. In other words, spam is what is generated by computers and sent by computers. If humans send mail, even if a mailing list is used or even if it is addressed to 1000s of recipients, it is not spam. It is solicited and coming from a human. You may think of it as spam but it is impossible to make a machine make a decision or an algorithm come to a conclusion. Consequently we have to conclude that spam is not unwanted mail. Spam is bulk mail or commercial mail.
Spam control math is no great shakes. For that matter even google's search algorithms are easy to understand. It is only the details that need genius to understand and troubleshoot. The basics of math are always easily understood by common sense.
I learnt Bayesian probability and statistics as part of my engineering degree long ago. It is a very simple concept. If you throw a fair die two times, the probability of getting 6 both times is a product of the probability of getting 6 the first time and the probability of getting 6 the second time.
Physical independence of probabilities is known by Bayesian probability theory.
See? It is not so complicated after all. If you read Paul Graham , you will find that he has expressed the same idea and how it relates to spam in many words. He is a great writer no doubt and his idea of applying this simple mathematical truth to a practical and relevant problem like spam is highly commendable. But read the next paragraph.
But the assumptions made by Paul Graham are not sound. You can have spammy content in legitimate mais and vice versa. So no matter how brilliant or adaptive or well performing your algorithm is, its delicate clockwork will blow to smithereens when fed with unexpected data.
Perhaps I should tell you how Bayesian theory relates to spam control. Two spammy words appearing close to one another have a certain probability of occurrence in spam. And a different probability of occurrence in legitimate mail. This can be compared to the two probabilities of a fair die giving a result of 6 both times.
CRM114 discriminator uses another simple mathematical concept called Markovian chains to further refine this algorithm. Whereas Bayesian probability can only account for characters and words occurring together in headers and mail bodies, Markovian chains have the mathematical ability to construct databases with even sentences. Evidently this is a lot of work and you require very powerful processor, memory and of course disk space.
And people only talk about spam efficiency. We stop 99.9999% of the spam messages. Now is this calculated as a percentage of total mail received(including spam) or is it taken against the spam messages that you did not get or is it some other metric coming from an obscure database? God alone knows.
Moreover they conveniently ignore the false positives problem. Your spam filter is great, it stops 99.999% of the spam. But you lost an important mail for an appointment with your boss. How good is your filter? Products usually do not mention this. They will provide you with technical support and spend time with you but what about the lost e-mail?
To mitigate this, many vendors have a concept called the 'quarantine'. It is a wholly unnecessary overhead invented for business purposes alone. It serves no practical purpose. Commercial interest forces people to show how much work they are doing. And sometimes the burden falls on your head. You should maintain the product and babysit it and manually interfere. You should pass 'parked mail messages'.
My customer had a harrowing time even after buying my product because I was present in the server room when his sys admin would painfully delete the mails and pass mails in the quarantine for the other domain for which he had not purchased my product. He had spent a lot of money buying the product. He could not throw it into the drain after all? Could he?
The other math involved in spam control is the math of the rsync algorithm. Or the checksum computation involved in cryptographic signatures. If you know the MD5 or the SHA1 message digest algorithm, you know what I mean. It is vaguely similar to symmetric encryption as it also involves multiple rounds of similar operations in math(mainly EX-OR and matrix multiplication), but you get a constant value as result. With a unique mathematical property that no two inputs can give the same output. MD5 gives a constant value of 128 bits as output and SHA1 gives 160 bits as output. Even the slightest change in input will cause widely varying output.
rsync uses this algorithm to detect file changes and a rolling checksum is computed after splitting files into blocks. And Vipul's razor and DCC use the same concept with a twist to detect spammers modifying their messages to send to other innocent bystanders on the Internet.
The careful reader will notice that this approach has problems too. I think that gmail uses this approach. We need a corpus of spam and this approach necessitates manual intervention. A global database of spammy content is required to feed the checksum computation engine and that is what is used to prevent others from getting the spam with modifications.
This approach naturally takes us to the next section. So now we are striving towards a better understanding of spammer mentality and motives. We are coming closer to the real world and consequently we can avoid the idealistic assumptions that Paul Graham's Bayesian approach had.
SpamCheetah and OpenBSD greylisting based products/approaches understand the psychology of spammers and appreciate the practical side of spam propagation and spam generation. Spam is not something that comes out of ether and escapes into the void. It is generated by humans that set machines into motion , and they can masquerade IP addresses, they can masquerade sender e-mail addresses, they can fake several other things and generate bounce messages or do backscatter, but there are certain things that they need to strictly abide by if they want to deliver their spam.
This fact will never change. Spammers want their message to land in your mailbox. And they want you to open in. It may be in Korean or Chinese, it may be image spam, it may be something else, but they cannot get around this basic fact.
The second truth that one has to realize is that spammers do this for money. They don't do this for fun and they don't do this for attaining nirvana. They do this because marketing sells. And e-mails cost nothing. They have to pay service providers and the websites the spam mails point to help out with them.
They also face the same problems that criminals face in every country. They have to grapple with the legal system. They have to face opposition, complaints and sometimes even punishment. So what do they do? They start operating in a clandestine fashion and operate in a manner that helps them escape detection. Hence Bogons which are unallocated IP addresses of BGP prefixes that spammers use to pump traffic, and once the harm is done, they go to some other location and wreak havoc. You have seen movies in which the villain has multiple identities and passports and how they fly to other countries.
We don't get sufficient time to react. As I said before, most of the traffic overloads in Internet routers are due to worms and spam. And if we knew their source, we could always plug the holes. But this is easier said than done. But we can protect ourselves with the right medicine.
What is the right medicine?
You can force spammers to pass a test which we keep for both innocent people and criminals. We know the innocent people will pass and that the criminals will get caught. And this is the test performed by greylisting and tarpitting. This is further helped by IP address blacklisting and e-mail address whitelisting. There are idiots amongst spammers who get caught and there are databases who track such current BGP netblocks that are known to send spam. SpamCheetah uses all the 3 approaches viz, greylisting, tarpit and blacklist of known spammers.
But greylisting has a problem. People don't like it because it delays the first mail from a domain that has never contacted you before. In practice this is never a problem but people are people after all and their anxiety needs to be addressed. And people come with baggage. Greylisting is an old concept and until now, nobody implemented this approach correctly. Design is one thing. Implementation yet another.
You need to mix greylisting with some pepper and salt to create the right medicine. And this recipe is reducing the TCP window of the SMTP dialogue. This is done by the OpenBSD tarpit. It takes genius to come up with such an idea but it is a very powerful concept. You not only send back the error message of 403 or 503 to the sender, you also subject the sender to another acid test.
You reply at the rate of 1 character per second. This can be very annoying for someone who wants to deliver million messages but for legitimate human generated senders this is nothing. Yet another application of understanding real life and nature better.
The greatest side effect of the science and math of OpenBSD greylisting is that I can now give spam control in a USB stick. Not the 16 Gig one, but in 1 GB. And you can run it in a box with tiny processing power and memory. After all we don't do mail. We don't need hard disks. We don't need to talk at high speed because our job is to talk slowly, and we don't need high processing power as we don't have to do content scanning. We only need the CPU to run our daemons that track IP addresses in our database. We don't store too much data either. More details here.
The other great side effect or benefit of OpenBSD greylisting is that you don't even allow the spammer to deliver the message to you. So I cannot prove that you would have received spam. I never receive it. I never allow the spammer to consume my bandwidth. Now how can I prove that I achieve x% spam catch rate? It is impossible. I save you precious bandwidth, mailbox storage space, backup costs and free your network for productive activity. Of course I can show that if you don't run this filter, you receive spam. That is all.
This is a great example of technical superiority enabling unimaginable possibilities.
If you found something useful today, please consider a small donation.
Got something to add? Send me email.
More Articles by Girish Venkatachalam © 2012-07-01 Girish Venkatachalam
Today the theory of evolution is about as much open to doubt as the theory that the earth goes round the sun. (Richard Dawkins)