APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

2005/03/09 Comment Spam

Web site owners like comment systems. Aside from letting visitors let us know their opinions of our probable IQ, genealogy, and sexual habits, comments also provide a way to correct errors, and add new information.

Unfortunately, spammers like to use comment systems for their own purposes, adding links to adult or other sites that don't quite fit the theme of our sites. People who use content managment systems to produce their site are apt to get a lot of attention from spammers just because the interface to their comment systems is public knowledge and it's easy for spammers to write automated form submissions. The content management producers have begun to fight back, applying various spam fighting techniques. Of course the spammers will try to learn how those work and thwart them, and on it goes.

What about those of us who write our own code? What can we do? Actually, quite a bit. For example, here's a little snippet of Perl that looks at comments here:

 my $spam=1;
 if (length($in) > 5) {
   $words++ while $in =~ /\w /g;
   $spam++ while $in =~ /http:/g;
   $toomuch=$words/$spam;
   $spam--;
   $in=" (looks like spam)" if $toomuch < $SPAM1 and $words > $SPAM1;
   $in=" (looks like spam - too many links) " if $spam > $SPAM2; 
   $in=" (looks like spam ?) " if (($words - $spam) < $SPAM3 and $words > $SPAM3);
 }
 

That code counts words and the number of times "http:" appears among them. It then makes some judgements based on the relative values (the SPAM1, etc. are set earlier in the script to values I think are reasonable). The math could be different depending on what type of comments you typically get, but this is the basic idea.

You can also look for certain words. If your site gets hit by a lot of spammers, the necessary words aren't too hard to figure out.

Additionally, I implement a time based requirement. If you make a post or a comment, you have to wait a certain number of seconds before posting again. The required time goes up with the cube of the number of posts made in the last 24 hours, so while your third post requires 8 units of time, the next is 27, then 64 and so on. This cuts down on multiple submissions that might otherwise get by the spam filtering.

Finally, I change the interface every now and then. An automated submission will set form values based upon its knowledge of your form. So, the form might be looking for "authtext" today, but next week it will be "inputtext". You can do this manually or programmatically.



Got something to add? Send me email.



1 comment



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Tony Lawrence







Thu Mar 10 20:43:31 2005: 161   anonymous


There has been some talk of people writing code to do SURBL.org lookups for blog spam. Also the devs of SpamAssassin have been talking about an offshoot called BlogAssassin to fight this kind of stuff.

Most likely the best way would be to code lookups for SURBL.

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it? (Brian Kernighan)

The camel has evolved to be relatively self-sufficient. (On the other hand, the camel has not evolved to smell good. Neither has Perl.) (Larry Wall)







This post tagged: