APLawrence - Information and Resources for Unix and Linux Systems, Bloggers and the self-employed
RSS Feeds Get APLawrence.com by RSS











(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Home > Websites, Blogging > Detecting Comment Spam, Part 1
Printer Friendly Version




Detecting Comment Spam, Part 1


Suppose you were writing a commenting system for a website and you wanted to check user input against a list of words that might indicate spam. You'd want the list of suspicious words in a file and you'd run through that list. An easy way to do that in Perl is to use Perl's "grep" command.

We'll start with a program that won't work. This will show a side effect of grep you need to be aware of:


#!/usr/bin/perl
# Build the "comment"
while (<>) {
push @TST, $_;
}
# Now test against our list
open(SPAM,"spamlist");
while (<SPAM>)  {
  chomp;

  if (grep /$_/, @TST  ) {
   print "found $_\n"; 
    
  }

}
 

You need this code and a "spamlist" file. You'd put the words you want to match in that file. For the purposes of this article, I'll assume that "fribble" is NOT in the list. Therefore, if you run this little script and type "fribble" and press Enter and then Cntrl-D, you'd expect no response - "fribble" isn't in the list of spam words.

But that's not what happens. When you run the program, type "fribble", it seems like "fribble" (or any other input) matches every word in the list. That can't be right, can it?

The problem is that "grep" modifies $_ in your loop. That's simple to fix; we set a temporary variable:

#!/usr/bin/perl
# Build the "comment"
while (<>) {
push @TST, $_;
}
# Now test against our list
open(SPAM,"spamlist");
while (<SPAM>)  {
  chomp;
  $testing=$_;
  if (grep /$testing/, @TST  ) {
   print "found $testing\n"; 
    
  }

}
 

That's a bit better. Running it with "fribble" produces no output, but if you give it something in your list, it finds it. Great!

Not quite. Let's say you had "ambien" in your list because you want to stop common pharmaceutical spam. If you type "ambien" when running the program, yes, it finds it, but it will also find "ambient". That's not good - how do we fix that?

Well, we want "ambien" only when it's a word by itself. Your first thought might be to use a space or "\s":












  if (grep / $testing /, @TST  ) 
  ...
  if (grep /\s$testing\s/, @TST  ) 
 

But that fails if "ambien" is at the beginning of a line in @TST. It works if "ambien" is at the end because \s matches end of line as "space" in addition to real spaces, tabs and formfeeds.

OK, we could do this:

 if (grep /\s$testing\s/, @TST   or  grep /\s$testing$/, @TST  or  igrep /^$testing\s/, @TST   ) 
 

Fortunately, we don't need to. Perl has a better way:

 if (grep /\b$testing\b/, @TST)
 

That "\b" is for "word boundary" and it does exactly what we want: it matches "ambien" wherever it is in a line by itself. Note that this same syntax works with command line grep, but "grep "\<word\>" files" only works with command line grep, not Perl.

So, finding "spam" words isn't too hard. The next question is what to do about them when you find them. A few thoughts come to mind:

  • Refuse the comment flatly.
  • Refuse the comment but tell the user what word(s) triggered the rejection.
  • Remove the offending words from the comment
  • Increment a counter for each spam indicator found; refuse if the count exceeds some value (this is how SpamAssassin works).
  • Set the comment to require administrative approval before posting

None of these are ideal. In our next post, I'll dig into that a bit more deeply.


If this page was useful to you, please click to help others find it:  

Your +1's can help friends, contacts, and others on the web find the best stuff when they search.

15 comments




More Articles by Anthony Lawrence - Find me on Google+



Click here to add your comments





Mon Dec 21 21:18:25 2009:   MikeHostetler
http://squarepegsystems.com
gravatar
I've actually used Akismet in the past -- if you are a small user, it's free and it works well. It uses the same ideas that Tony is going to talk about, but it uses it for a much wider audience. More people to count means (ideally) better patter detection.

http://akismet.com/



Mon Dec 21 21:31:22 2009:   TonyLawrence

gravatar
I see a lot of people use Askimet. I still prefer to roll my own - for many reasons.



Mon Dec 21 21:44:43 2009:   TonyLawrence

gravatar
I do notice that Askimet has a Perl module on CPAN :

http://search.cpan.org/~nikolay/Net-Akismet/lib/Net/Akismet.pm

With that, you could "roll your own" while still taking advantage of Askimet.

I need to rewrite my comments code; I might just do that... no harm in having another opinion, after all.



Tue Dec 22 11:33:07 2009:   Michiel

gravatar
I'm experimenting with anti-spam measures too, and I think this is what I will use: When my blog detects that a contribution is made from an unknown IP address, it'll present a 'ReCAPTCHA' (See http://recaptcha.net/ ) for verification. I'm not sure scanning for spam words will work as well, since spammers always find new ways to hide their message from automatic scanners.



Tue Dec 22 11:45:00 2009:   TonyLawrence

gravatar
I'm not sure scanning for spam words will work as well, since spammers always find new ways to hide their message from automatic scanners.

I'll be talking about this more in the next part, but "words" aren't just things like pharmaceutical names. One thing spammers can't obfuscate is their link destination - the whole point is to post a link. The destination is part of the spam list.



Tue Dec 22 12:55:33 2009:   Ralph
http://linuxcoaching.eu
gravatar
In recent times there have been blog comments that only consist of legitimate words, so that the spam checker would not find any cause for rejection. But I regard these comments as spam too, because they are totally unrelated to the blog posting and their only purpose is to place a link to a suspect page. These comments tend to be overwhelmingly "positive" (and slimey) but they're nevertheless junk and clutter the blog with nonsensical praise.

I fear, we cannot fight spammers with software.



Tue Dec 22 13:08:17 2009:   TonyLawrence

gravatar
their only purpose is to place a link to a suspect page

Exactly. That's why the page(s) go into my spam list.

we cannot fight spammers with software

We can. But like any war, we'll lose a few battles. We can't stop EVERY spam post with software, but we can stop most of them.




Tue Dec 22 13:43:56 2009:   MikeHostetler
http://squarepegsystems.com
gravatar
Michial said, "When my blog detects that a contribution is made from an unknown IP address, it'll present a 'ReCAPTCHA' (See http://recaptcha.net/ ) for verification"

The problem with ReCAPTCHA is that it annoys legitimate users. But your use of only doing it for unknown IP addresses will help a little bit. But what if a legitimate user and a spammer are both behind the same firewall? Not to mention that sometimes figuring out what the Captcha is can be difficult (though ReCAPTCHA is better than most).

Tony said, "I need to rewrite my comments code; I might just do that... no harm in having another opinion, after all. "

I used Python's Askimet library to do some integration and it was very easy. You're right -- it's never bad to another opinion.



Tue Dec 22 14:23:39 2009:   TonyLawrence

gravatar
But what if a legitimate user and a spammer are both behind the same firewall?

Perhaps similar to what I do to decide if the comment can be published immediately or needs moderation. I key on IP plus username (anonymous is always moderated). A spammer could only guess at usernames that would match a legitimate users IP.

So Michiel could check both the IP and the username before throwing up a captcha - if the user has had that ip before, no captcha.





Tue Dec 22 14:40:41 2009:   TonyLawrence

gravatar
By the way, I want to wish all who are reading today a happy holiday.

We're pretty busy this week getting ready to have family in over the weekend, so I don't plan on Part 2 of this until next week. In that post, I want to discuss more about some of the ideas raised here in the comments: the ideals and realities of controlling spam comments, possible methods, pros and cons and so on.





Tue Dec 22 15:52:19 2009:   Ralph
http://linuxcoaching.eu
gravatar
Tony replied:

their only purpose is to place a link to a suspect page
Exactly. That's why the page(s) go into my spam list.

But if you want to build such a list in these cases you have to examine the link "by hand" to find out if it is suspect or not, I cannot imagine any software to do that. I've decided to delete an unrelated comment based on the lack of information in it without looking at the link.



Tue Dec 22 16:09:21 2009:   TonyLawrence

gravatar
But if you want to build such a list in these cases you have to examine the link "by hand" to find out if it is suspect or not

Not entirely. I'll be talking more about that in the next post.

The war on spam is never a matter of using one tool. It's a combination of approaches that will give you good control.



Tue Dec 22 17:24:59 2009:   BigDumbDInosaur
http://bcstechnology.net
gravatar
If qualifying words in text against a list of verboten words (e.g., Viagra) is to be a primary anti-spam tool, I'd be inclined to keep the verboten word list sorted and use a binary search on it. A binary search for a single object in an ordered list executes in, worst-case, O(logN) iterations (where N is the number of words in the list), whereas grep's linear search will always require N iterations to determine that a suspect word is not verboten. This aspect of grep could represent a significant amount of processing time on a busy site, especially with long-winded posts.

Also, it might be useful to develop some sort of mechanism that could automate the adding of bad words to the verboten list. Phrases might be good as well, as oftentimes phrases are spam whereas individual words used in a phrase may be benign.



Tue Dec 22 21:55:32 2009:   TonyLawrence

gravatar
If you are searching a large list, certainly. Here, it's less than 200 lines and 2K. . But more importantly - we're not searching the list, we are searching the post for words in the list. You'd need to invert the search (take each word in the post and search in the list). I'm doubt this would be faster for the typical data sets, but it would be interesting to try some tests.

Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar



Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.

My Hard Truths about Easy Money on the Internet will show you how to make money on the Internet!

book graphic Internet Income guide



 I sell and support
 Kerio Mail server
pavatar.jpg

This post tagged:

       - Blogging
       - Perl
       - Web/HTML




Unix/Linux Consultants

Skills Tests

Guest Post Here













card_image