Another Perl link checker


2013/07/11

Checking links is the most unpleasant part of running a large website. While there are hundreds (if not thousands) of linkcheckers available on the web, none of them really meet my needs.

I explained a bit about why at Smarter HTML Link Extractor and I used the code presented there for some time even though it wasn't what I wanted either.

Recently I started using this code, which, while not perfect, meets my needs more closely than anything else I have used. I'll explain why after we look at the code:



#!/usr/bin/perl
use LWP::Simple;
use File::Basename;
require HTML::LinkExtor;
$checkfile=shift @ARGV;
open(LINKS, ">/root/linkcheck.out");

$cookie_file = "/root/data/cookies/cookies.txt";
if  (not $checkfile) {
  print "Getting paster\n";
  $checkfile=`cat /root/data/paster`;
  $debug=1;
}
chomp $checkfile;
$p = HTML::LinkExtor->new();
$p->parse_file("/root/data/text$checkfile");  
@links=$p->links;

 foreach (@links) {
        $type=@$_[0];
        # only want anchor links";
        next if $type ne "a";
	$link=@$_[2];
        next if $link =~ /mailto:/;
	chomp $link;
        print "Checking link $type $link\n" if $debug;
        print "LINKS Checking link $type $link\n" if $debug;
        if ($link !~ /http:/ and $link !~ /https:/ and $link !~ /ftp:/) {
          print "Fixing relative\n" if $debug;
          print LINKS "Fixing relative\n" if $debug;
          $link=~ s/^/http:\/\/aplawrence.com/;
        }
        $blink=$link;
        $blink=~ s/http:..//;
        # No need to check links to interior pages
	next if -e "/srv/www/aplawrence.com/apl/$blink"; 
        $deep=0;
        $moved=0;
        $resp=0;
        $r=check($link);

        myprint("$link may be dead\n") if not $r;
        print "Relocated\n" if (not $r and $deep);
        if (not $r and $resp) {
         print "but did get a response\n";
        }
  }


sub check {
my $url=shift;
$resp=0;
@stuff=();
# If this is the first try, let's just see if we get a page
if (not $deep) {
myprint qq(curl -c $cookie_file  -m 1 -r 0-1024 -A "Mozilla/4.0" -s  $url 2>&1) if $debug;
@stuff=`curl  -c $cookie_file  -m 1 -r 0-1024 -A "Mozilla/4.0" -s  $url 2>&1`;
foreach (@stuff) {
  s/^\s+//;
  next if not $_;
  if (/<html/i or /DOCTYPE/) {
   $resp=1;
   last;
  }
}
}
# Get headers
@stuff=`curl -G -b $cookie_file  -m 1 -A Mozilla/4.0  -s --head  $url 2>&1`;
if ($stuff[0] =~ /curl.* Couldn't resolve/) {
 return 0;
 # No such site
}
if ($stuff[0] =~ / 200 OK/) {
     myprint("$url\n") if $debug;
    if ($deep) {
     myprint("\n\nPossibly replace $link with \n $url \n");
     }
     
    return 1;
}

if ($stuff[0] =~ / 301 /) {
 myprint("$url: $stuff[0]\n") if $debug;
 $moved=1; 
 # Moved ..
}
foreach(@stuff) {
chomp;
if ($moved and /Location: / and not $deep) {
  $deep++;
  s/Location: //;
  print("Recheck $_\n") if $debug;
  return(check($_));
}
}
# Some other return, needs manual checking
return 0;
}
sub myprint {
my $string=shift;
print $string;
print LINKS $string;
foreach(@stuff) {
  print LINKS $_;
}
}
 

I think that's about as close as I'm going to get, for various reasons.

First off, you'll notice I used curl. Actually, I originally wrote this using LWP, but it was easier to debug problems using curl at the command line, so I went that way instead. Now that I've figured out what I need to do, I could rewrite it for LWP, but really, why bother? The speed issue here is the internet, not the cost of firing off "curl".

The other issue is that it needs to be slightly broken.

Yeah, that sounds funny, but consider this: sometimes a website will accept a link that is long gone and relocate you somewhere pointless - or at least pointless from this point of view. While the relocate might make sense for an old link in some cases, it may not in others, so I need to look at all those cases individually.

There are links that will fail for other reasons: they need a POST instead of a GET, for example. Again, I want to look at those manually before I accept them or let them pass.

Finally, I don't correct relative links of the form "xyz.html". I don't want to leave those like that anyway as they will fail should I someday move the page to some other directory, so I let them fail the check even though they are usually valid.

I thought about adding some code that will make a record of "accepted" links so that I need not check them again the next time I run this, but as things at the website may have changed by then, that doesn't make sense. I may need to repeat my work next year, but that's the way it is.

As there is so much to check, I don't even try to run through the whole site. I run this when I have some other reason to look at a page: somebody left a comment, or it came up when I was searching for something. Other than that, broken links could sit here for many a year!



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Another Perl link checker




Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence



Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





An editor is a person who knows precisely what he wants, but isn't quite sure. (Walter Davenport)

Just because they've sold you an IP based phone system doesn't mean they know anything about IP, does it? (Tony Lawrence)








This post tagged: