APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Another Perl link checker

© July 2013 Anthony Lawrence


Checking links is the most unpleasant part of running a large website. While there are hundreds (if not thousands) of linkcheckers available on the web, none of them really meet my needs.

I explained a bit about why at Smarter HTML Link Extractor and I used the code presented there for some time even though it wasn't what I wanted either.

Recently I started using this code, which, while not perfect, meets my needs more closely than anything else I have used. I'll explain why after we look at the code:

use LWP::Simple;
use File::Basename;
require HTML::LinkExtor;
$checkfile=shift @ARGV;
open(LINKS, ">/root/linkcheck.out");

$cookie_file = "/root/data/cookies/cookies.txt";
if  (not $checkfile) {
  print "Getting paster\n";
  $checkfile=`cat /root/data/paster`;
chomp $checkfile;
$p = HTML::LinkExtor->new();

 foreach (@links) {
        # only want anchor links";
        next if $type ne "a";
        next if $link =~ /mailto:/;
	chomp $link;
        print "Checking link $type $link\n" if $debug;
        print "LINKS Checking link $type $link\n" if $debug;
        if ($link !~ /http:/ and $link !~ /https:/ and $link !~ /ftp:/) {
          print "Fixing relative\n" if $debug;
          print LINKS "Fixing relative\n" if $debug;
          $link=~ s/^/http:\/\/aplawrence.com/;
        $blink=~ s/http:..//;
        # No need to check links to interior pages
	next if -e "/srv/www/aplawrence.com/apl/$blink"; 

        myprint("$link may be dead\n") if not $r;
        print "Relocated\n" if (not $r and $deep);
        if (not $r and $resp) {
         print "but did get a response\n";

sub check {
my $url=shift;
# If this is the first try, let's just see if we get a page
if (not $deep) {
myprint qq(curl -c $cookie_file  -m 1 -r 0-1024 -A "Mozilla/4.0" -s  $url 2>&1) if $debug;
@stuff=`curl  -c $cookie_file  -m 1 -r 0-1024 -A "Mozilla/4.0" -s  $url 2>&1`;
foreach (@stuff) {
  next if not $_;
  if (/<html/i or /DOCTYPE/) {
# Get headers
@stuff=`curl -G -b $cookie_file  -m 1 -A Mozilla/4.0  -s --head  $url 2>&1`;
if ($stuff[0] =~ /curl.* Couldn't resolve/) {
 return 0;
 # No such site
if ($stuff[0] =~ / 200 OK/) {
     myprint("$url\n") if $debug;
    if ($deep) {
     myprint("\n\nPossibly replace $link with \n $url \n");
    return 1;

if ($stuff[0] =~ / 301 /) {
 myprint("$url: $stuff[0]\n") if $debug;
 # Moved ..
foreach(@stuff) {
if ($moved and /Location: / and not $deep) {
  s/Location: //;
  print("Recheck $_\n") if $debug;
# Some other return, needs manual checking
return 0;
sub myprint {
my $string=shift;
print $string;
print LINKS $string;
foreach(@stuff) {
  print LINKS $_;

I think that's about as close as I'm going to get, for various reasons.

First off, you'll notice I used curl. Actually, I originally wrote this using LWP, but it was easier to debug problems using curl at the command line, so I went that way instead. Now that I've figured out what I need to do, I could rewrite it for LWP, but really, why bother? The speed issue here is the internet, not the cost of firing off "curl".

The other issue is that it needs to be slightly broken.

Yeah, that sounds funny, but consider this: sometimes a website will accept a link that is long gone and relocate you somewhere pointless - or at least pointless from this point of view. While the relocate might make sense for an old link in some cases, it may not in others, so I need to look at all those cases individually.

There are links that will fail for other reasons: they need a POST instead of a GET, for example. Again, I want to look at those manually before I accept them or let them pass.

Finally, I don't correct relative links of the form "xyz.html". I don't want to leave those like that anyway as they will fail should I someday move the page to some other directory, so I let them fail the check even though they are usually valid.

I thought about adding some code that will make a record of "accepted" links so that I need not check them again the next time I run this, but as things at the website may have changed by then, that doesn't make sense. I may need to repeat my work next year, but that's the way it is.

As there is so much to check, I don't even try to run through the whole site. I run this when I have some other reason to look at a page: somebody left a comment, or it came up when I was searching for something. Other than that, broken links could sit here for many a year!

Got something to add? Send me email.

(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> Another Perl link checker

Inexpensive and informative Apple related e-books:

Take Control of the Mac Command Line with Terminal, Second Edition

Photos for Mac: A Take Control Crash Course

Digital Sharing Crash Course

Take Control of iCloud

Photos: A Take Control Crash Course

More Articles by © Anthony Lawrence

Printer Friendly Version

Have you tried Searching this site?

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us

Printer Friendly Version

Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better. ((Edsger W. Dijkstra)

Linux posts

Troubleshooting posts

This post tagged:




Unix/Linux Consultants

Skills Tests

Unix/Linux Book Reviews

My Unix/Linux Troubleshooting Book

This site runs on Linode