Detecting Comment Spam, Part 1

2009/12/21

Suppose you were writing a commenting system for a website and you wanted to check user input against a list of words that might indicate spam. You'd want the list of suspicious words in a file and you'd run through that list. An easy way to do that in Perl is to use Perl's "grep" command.

We'll start with a program that won't work. This will show a side effect of grep you need to be aware of:


#!/usr/bin/perl
# Build the "comment"
while (<>) {
push @TST, $_;
}
# Now test against our list
open(SPAM,"spamlist");
while (<SPAM>)  {
  chomp;

  if (grep /$_/, @TST  ) {
   print "found $_\n"; 
    
  }

}
 

You need this code and a "spamlist" file. You'd put the words you want to match in that file. For the purposes of this article, I'll assume that "fribble" is NOT in the list. Therefore, if you run this little script and type "fribble" and press Enter and then Cntrl-D, you'd expect no response - "fribble" isn't in the list of spam words.

But that's not what happens. When you run the program, type "fribble", it seems like "fribble" (or any other input) matches every word in the list. That can't be right, can it?

The problem is that "grep" modifies $_ in your loop. That's simple to fix; we set a temporary variable:

#!/usr/bin/perl
# Build the "comment"
while (<>) {
push @TST, $_;
}
# Now test against our list
open(SPAM,"spamlist");
while (<SPAM>)  {
  chomp;
  $testing=$_;
  if (grep /$testing/, @TST  ) {
   print "found $testing\n"; 
    
  }

}
 

That's a bit better. Running it with "fribble" produces no output, but if you give it something in your list, it finds it. Great!

Not quite. Let's say you had "ambien" in your list because you want to stop common pharmaceutical spam. If you type "ambien" when running the program, yes, it finds it, but it will also find "ambient". That's not good - how do we fix that?

Well, we want "ambien" only when it's a word by itself. Your first thought might be to use a space or "\s":

  if (grep / $testing /, @TST  ) 
  ...
  if (grep /\s$testing\s/, @TST  ) 
 

But that fails if "ambien" is at the beginning of a line in @TST. It works if "ambien" is at the end because \s matches end of line as "space" in addition to real spaces, tabs and formfeeds.

OK, we could do this:

 if (grep /\s$testing\s/, @TST   or  grep /\s$testing$/, @TST  or  igrep /^$testing\s/, @TST   ) 
 

Fortunately, we don't need to. Perl has a better way:

 if (grep /\b$testing\b/, @TST)
 

That "\b" is for "word boundary" and it does exactly what we want: it matches "ambien" wherever it is in a line by itself. Note that this same syntax works with command line grep, but "grep "\<word\>" files" only works with command line grep, not Perl.

So, finding "spam" words isn't too hard. The next question is what to do about them when you find them. A few thoughts come to mind:

  • Refuse the comment flatly.
  • Refuse the comment but tell the user what word(s) triggered the rejection.
  • Remove the offending words from the comment
  • Increment a counter for each spam indicator found; refuse if the count exceeds some value (this is how SpamAssassin works).
  • Set the comment to require administrative approval before posting

None of these are ideal. In our next post, I'll dig into that a bit more deeply.



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Detecting Comment Spam, Part 1


15 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Mon Dec 21 21:18:25 2009: 7786   MikeHostetler

gravatar


I've actually used Akismet in the past -- if you are a small user, it's free and it works well. It uses the same ideas that Tony is going to talk about, but it uses it for a much wider audience. More people to count means (ideally) better patter detection.

(link)



Mon Dec 21 21:31:22 2009: 7787   TonyLawrence

gravatar


I see a lot of people use Askimet. I still prefer to roll my own - for many reasons.



Mon Dec 21 21:44:43 2009: 7788   TonyLawrence

gravatar


I do notice that Askimet has a Perl module on CPAN :

(link)

With that, you could "roll your own" while still taking advantage of Askimet.

I need to rewrite my comments code; I might just do that... no harm in having another opinion, after all.



Tue Dec 22 11:33:07 2009: 7790   Michiel

gravatar


I'm experimenting with anti-spam measures too, and I think this is what I will use: When my blog detects that a contribution is made from an unknown IP address, it'll present a 'ReCAPTCHA' (See
(link) ) for verification. I'm not sure scanning for spam words will work as well, since spammers always find new ways to hide their message from automatic scanners.



Tue Dec 22 11:45:00 2009: 7791   TonyLawrence

gravatar


I'm not sure scanning for spam words will work as well, since spammers always find new ways to hide their message from automatic scanners.

I'll be talking about this more in the next part, but "words" aren't just things like pharmaceutical names. One thing spammers can't obfuscate is their link destination - the whole point is to post a link. The destination is part of the spam list.



Tue Dec 22 12:55:33 2009: 7793   Ralph

gravatar


In recent times there have been blog comments that only consist of legitimate words, so that the spam checker would not find any cause for rejection. But I regard these comments as spam too, because they are totally unrelated to the blog posting and their only purpose is to place a link to a suspect page. These comments tend to be overwhelmingly "positive" (and slimey) but they're nevertheless junk and clutter the blog with nonsensical praise.

I fear, we cannot fight spammers with software.



Tue Dec 22 13:08:17 2009: 7794   TonyLawrence

gravatar


their only purpose is to place a link to a suspect page

Exactly. That's why the page(s) go into my spam list.

we cannot fight spammers with software

We can. But like any war, we'll lose a few battles. We can't stop EVERY spam post with software, but we can stop most of them.






Tue Dec 22 13:43:56 2009: 7795   MikeHostetler

gravatar


Michial said, "When my blog detects that a contribution is made from an unknown IP address, it'll present a 'ReCAPTCHA' (See (link) ) for verification"

The problem with ReCAPTCHA is that it annoys legitimate users. But your use of only doing it for unknown IP addresses will help a little bit. But what if a legitimate user and a spammer are both behind the same firewall? Not to mention that sometimes figuring out what the Captcha is can be difficult (though ReCAPTCHA is better than most).

Tony said, "I need to rewrite my comments code; I might just do that... no harm in having another opinion, after all. "

I used Python's Askimet library to do some integration and it was very easy. You're right -- it's never bad to another opinion.



Tue Dec 22 14:23:39 2009: 7796   TonyLawrence

gravatar


But what if a legitimate user and a spammer are both behind the same firewall?

Perhaps similar to what I do to decide if the comment can be published immediately or needs moderation. I key on IP plus username (anonymous is always moderated). A spammer could only guess at usernames that would match a legitimate users IP.

So Michiel could check both the IP and the username before throwing up a captcha - if the user has had that ip before, no captcha.







Tue Dec 22 14:40:41 2009: 7797   TonyLawrence

gravatar


By the way, I want to wish all who are reading today a happy holiday.

We're pretty busy this week getting ready to have family in over the weekend, so I don't plan on Part 2 of this until next week. In that post, I want to discuss more about some of the ideas raised here in the comments: the ideals and realities of controlling spam comments, possible methods, pros and cons and so on.







Tue Dec 22 15:52:19 2009: 7800   Ralph

gravatar


Tony replied:

their only purpose is to place a link to a suspect page
Exactly. That's why the page(s) go into my spam list.

But if you want to build such a list in these cases you have to examine the link "by hand" to find out if it is suspect or not, I cannot imagine any software to do that. I've decided to delete an unrelated comment based on the lack of information in it without looking at the link.



Tue Dec 22 16:09:21 2009: 7801   TonyLawrence

gravatar


But if you want to build such a list in these cases you have to examine the link "by hand" to find out if it is suspect or not

Not entirely. I'll be talking more about that in the next post.

The war on spam is never a matter of using one tool. It's a combination of approaches that will give you good control.



Tue Dec 22 17:24:59 2009: 7802   BigDumbDInosaur

gravatar


If qualifying words in text against a list of verboten words (e.g., *redacted*) is to be a primary anti-spam tool, I'd be inclined to keep the verboten word list sorted and use a binary search on it. A binary search for a single object in an ordered list executes in, worst-case, O(logN) iterations (where N is the number of words in the list), whereas grep's linear search will always require N iterations to determine that a suspect word is not verboten. This aspect of grep could represent a significant amount of processing time on a busy site, especially with long-winded posts.

Also, it might be useful to develop some sort of mechanism that could automate the adding of bad words to the verboten list. Phrases might be good as well, as oftentimes phrases are spam whereas individual words used in a phrase may be benign.



Tue Dec 22 21:55:32 2009: 7803   TonyLawrence

gravatar


If you are searching a large list, certainly. Here, it's less than 200 lines and 2K. . But more importantly - we're not searching the list, we are searching the post for words in the list. You'd need to invert the search (take each word in the post and search in the list). I'm doubt this would be faster for the typical data sets, but it would be interesting to try some tests.

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





I just had to take the hypertext idea and connect it to the TCP and DNS ideas and — ta-da!— the World Wide Web. ((Tim Berners-Lee)

If you just want to use the system, instead of hacking on its internals, you don't need source code. (Andrew S. Tanenbaum)








This post tagged: