APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Detecting Comment Spam, Part 2

This is a continuation of Detecting Comment Spam, Part 1

In part 1, I talked about code to read a list of spammish words from a file and look for those words in comment posts. Commenters pointed out that spammers will obfuscate words with dashes, spaces, bizarre spellings and so on, making it very difficult to catch these programmatically. That's true, but there's more to the story.

The spam list I use here has some of those common spam words, but most of it is taken up with web addresses. Links are much more difficult for spammers to mangle - they can use redirection at the destination site, but the site itself is static: if a spammer wants you to visit some page at Iamaspammer.com, either that name or its IP must be in a link. Most of my spam list is websites that have been the destination for spammers links. Once the site is in my list, the spammer can never post anything with that link in it - no matter how they mung other words, they will never be allowed to post. A comment here that contains one of those links doesn't even go to moderation - it just gets flatly rejected.

Spammers do move on - jklljas.blogspot.com may be a spam link now, but it will get abandoned eventually. I trim the list every month to remove old entries.

IP Blocking

Should't I just block the spammer's IP from leaving comments? Yes, but spammer's IP's change over time - their IP gets blocked everywhere so they move on. If you block a particular IP forever, it may end up being the IP of a legitimate user, so you probably don't want to do that. There's is also the issue that your list of banned IP's could get very large over time.

For some websites, it makes sense to block by country of origin. I don't do that here, but if you only want U.S. visitors, you could certainly do that. See Blocking Unwanted Visitors.

You can do the blocking inside your script or use .htaccess (or your Apache configuration files). I prefer to block inside my comment posting script because if I'm wrong or if the IP has transferred to a non-spammer, I'm only blocking them from commenting, not from any site access. In extreme cases (such as a spammer attempting to use any cgi script it can find or guess) I will add them to the .htaccess file. However, whether in scripts or .htaccess, I don't keep the ip blocked for more than a week at a time.

For me, the need to block by IP is infrequent enough that I don't need to automate the removal of bans, but it isn't difficult to write such code if needed.

Moderation

Let's talk about moderation for a moment. Some sites moderate all comments, but that's annoying for both the moderator and the people leaving comments. Regular posters shouldn't have to wait for their comments to appear and most web site owners have better things to do than moderate comments all day long.

One solution is to require registration. If a user can provide a login and password that has been previously approved, their comment can be posted immediately. Many sites use that scheme, but there is still a degree of annoyance: the registration process is an extra step that annoys some people.

I do something similar here, but there's no registration process per se. If you have posted here legitimately in the past and are posting again from the same IP address and with the same username (actual username, not "anonymous") your post will not need to be moderated - assuming it passes all the other tests I'll talk about below!

But as alluded to when talking about destination links above, there's no need to moderate posts that are definitely spam - we just throw those away.

Other spam control

I'm going to talk frankly about all the things I do here to limit spam and what I plan to do in my next version of the commenting software. I suppose there is some small risk to this; I've been reluctant to discuss all of it before because I don't want to help spammers learn better ways to bypass systems like this, but I don't think spammers are likely to read this and if they do, oh well: the war goes on.

Lazy Spammers

Spammers aren't generally going to spend a lot of time and effort on posts. The pros are using scripts, the inept are probably at least automated enough to use cut and paste. Even if they are typing comments in, they probably tend to reuse the same words and links.

Not much text

Some habits of lazy spammers are fairly easy to block. One is a habit of only leaving links with little or any text. That's easy to detect: just count the total words in the posts and divide by the number of http links: if the result is too small, the post is probably spam. I do that here now - if you leave a comment that is only a link, it won't get posted and it won't even go to moderation - whether the link is legitimate or not. That might be too draconian, but I think you should at least flag such posts for moderation. I do need to fix my present code to allow known user/ip posters to leave such short comments.

Nonsense words

When lazy spammers do pad their word count, they often use nonsense words. Again, that's relatively easy to thwart if you have the luxury of time and disk space- look up each word in a dictionary and compare found words to unfound - if the ratio is too high, this is probably spam. But that requires a fair amount of work. I take a simpler approach:

If $in is the text to be checked, the Perl expression

$consonants++ while $in =~ /[qwrtpssdfghjklzxcvbnm]{4,}/ig;
 

counts the number of times four or more consonants appear in a row. That is, it counts garbage like "fghk" or "hdfr" - the kind of junk you'll get from random banging on the keyboard. If that count is high when compared to the number of words in the input, you probably have spam. My code says if the ratio of nonsense to words is over 15% it's spam. This is much faster than checking input against a dictionary and is very effective.

Remember that "jklljas.blogspot.com"? If all he posts is something like "Xanax: http://klljas.blogspot.com", he's got a 50% nonsense count ("http" and "jklljas" against 4 words) - that's enough to count as spam right there. Not all spammers use nonsense words in their sites, but quite a few do, and many are too lazy to type much more than that.

This won't catch all nonsense: we've all seen random phrases from books used as a preface to a link. Nonetheless, this stops SOME spam, and every post we stop is one that doesn't annoy us or our readers.

Direct POSTS

Another thing spammers do is send direct POSTS. That is, their automated software examines your comment form once, picks out the fields it needs to supply, and then submits multiple POST requests. I'll don't allow that: when the comment form is first loaded, the commenter's IP address is stored in a database. When they actually POST, the IP is checked and immediately removed. If the IP doesn't exist (which it will not after the first POST), the comment is thrown away. The spammer can defeat this by requesting the form before doing each POST, but many of these folks are too lazy to bother.

Typed too fast

After reading a link suggested in the comments, I realized that there out to be a minimum thinking/typing time also. So along with the IP, I store the time the form was loaded. When the POST is made, I count the words and divide by 10 - if at least that many seconds haven't elapsed, I won't allow the post. Only a cut and paste spammer can type more than 10 words per second, so that's a fair limit - maybe even less is fair, but a legitimate poster might cut and paste some text..

Excessive posting

Some spammers are greedy. They aren't content with putting one piece of graffiti on your site; they want to leave many spam posts. I use a timing algorithm to control that. It's simple enough: for each post from your IP, a counter is incremented. I use that counter to determine how long before you are allowed to post again. You can't make your second post until 15 seconds after your first, your third until 120 seconds after that, your fourth until 405 seconds and so on - it's the number of posts cubed times 15 seconds. This very effectively stops greedy spammers - they don't hang around. In the current version, these limits affect everyone (even me!) but I want to make them less restrictive (maybe posts cubed times 5 seconds) for known users in the next version.

The war is never over

So that's how I control spam comments here. Most obvious spam goes right to a black hole, users I know and trust get posted immediately, and everything else goes to moderation. I'm going to add an Akismet check in the next version - it never hurts to have a second opinion!

Instead of moderation, you could also throw up Captcha or arithmetic challenges, or require a random series of other answers or actions when a post is suspect. A spammer (especially an automated spammer) probably won't respond to even a simple "Click here to confirm your post" - I may add something like that to my next version. That's a minor annoyance for a first time poster or someone who insists upon using "anonymous", but it will stop most spammers dead.

Of course it is impossible to stop all spam. No matter what we do, we'll at least have to moderate a post now and then. However, with the spam controls I have here, I very seldom see spam - and if I do see it, I'm unlikely to see it again, because I'll adjust my code as necessary. I do have to moderate new visitors and frequent posters with new IP addresses, but that's not very onerous.

The war on spam never ends, but we can win most of the battles.

See Detecting Comment Spam, Part 3 for the continuation of this series.



Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Detecting Comment Spam, Part 2


17 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Wed Dec 23 20:37:01 2009: 7811   MikeHostetler

gravatar


I think you have a good plan here -- it's one thing for the spammers to know what you are doing to stop them, it's another for them to find a hole in your system. One thing about a good plan is that it has to be adaptable -- you may lose a battle here and there, but if you learn from your mistakes and tweak here and there, you will win many more than you will lose.

It would be interesting if you had a "minority report" bucket for comments -- ones that your system and Akismet have different opinions on. Maybe you can review them manually for a while -- perhaps you will learn a few things. Or maybe you will find out that you have a better system than they do!

Another interesting idea is to see if you can hook up SpamAssassin also help. I can't imagine the rules for email spam to be tons different than blog spam. Or maybe they are. Some notes have been done on it, but I think that's about it:
(link)



Wed Dec 23 20:41:44 2009: 7812   TonyLawrence

gravatar


You'll be amused by this, Mike: Akismet says your comment is spam :-)

Obviously I disagree. I'll take a look at that Spamasassin link, thanks.



Wed Dec 23 20:51:37 2009: 7813   TonyLawrence

gravatar


A comment at that Spammassin link gave me another idea:

When the form is loaded, I store the time against the users IP. When they hit POST, it's allowed if the time lapse is less than 2 hours (that's a pretty generous time limit to compose your post). The time is reset to zero upon posting, so a new POST without a form load won't work.

The Spamassasin link suggested that a minimum time makes sense too - a real user presumably spends at least a few seconds composing a message. That makes sense - I'll be adding something to do that.




Wed Dec 23 20:57:49 2009: 7814   TonyLawrence

gravatar


Yeah, that makes sense: count the words, divide by 10 (pretty fast typing). If it hasn't been that many seconds between loading and posting, disallow.



Wed Dec 23 21:49:21 2009: 7815   MikeHostetler

gravatar


That's hilarious that they marked me as spam. Probably because I used a link. See -- evidence that you need a multi-prong attack!



Thu Dec 24 03:11:59 2009: 7819   TonyLawrence

gravatar


I found the problem. As I mentioned before, the old server sent cgi href's as GET's - this one always uses POST. My comment code was still expecting a GET on the first load - which led me to check Akismet twice, the first time without any data.

Fixed now (I hope).



Thu Dec 24 13:27:41 2009: 7821   TonyLawrence

gravatar


Just as a general point of interest:

Last night this code caught 4 comments, three of which were pure spam and thrown away and one which I had to moderate (and subsequently toss because it was spam). Akismet didn't think that one was spam amd I don't send the pure junk to them for checking.

I don't track attempted comments thrown away because of posting too quickly but I can see in the logs that spammers do get caught by that.






Thu Dec 24 14:01:58 2009: 7822   TonyLawrence

gravatar


Akismet just caught one. My consonant filter also caught it and it was tagged again for links to words ratio, so there wasn't much hope for that one to get through :-)



Fri Dec 25 02:58:22 2009: 7824   Michiel

gravatar


Interesting, I'm going to follow these articles. It gives me some good ideas. In the meantime I have some stuff done, and reCAPTCHA is next. See www.michielovertoom.com/simpleblog/log.html (link dead, sorry) for a short story on my proceedings.



Tue Dec 29 02:54:38 2009: 7831   Michiel

gravatar


Yesterday I implemented reCAPTCHA in my prototype blog software. It was easier than expected: all you have to do is to sign up and receive some keys in the mail, then include a little bit of code in your script. Basically two functions do all the work, and they are provided in a ready-to-include script file.

I followed the suggestion done earlier to remember both the username and IP adress as a combination, and only require captcha validation once.

I also updated the webpage in which I describe the steps I took. See michielovertoom.com/simpleblog/log.html (link dead, sorry) .

I'm looking forward to implement some more antispam ideas!






Fri Feb 5 14:00:44 2010: 8032   TonyLawrence

gravatar


I tried Akismet for a month or so. I found that it seldom helped me - that is, if it said something was spam, my code usually already knew that was true and my code caught spam that it didn't know.

There were a few instances where it saw spam that I didn't. These were so few that I didn't feel it was worth paying the monthly fee (I have far too much traffic to qualify for the free version).

It's a good tool - if you can't write your own code, I would definitely recommend it.

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





[C has] the power of assembly language and the convenience of … assembly language. (Dennis Ritchie)

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. (Donald Knuth)












This post tagged: