Smarter HTML Link Extractor

It is an unfortunate fact that links go bad. That is annoying for your visitors and can also cause Google and other search engines to devalue your pages.

Checking links isn't really hard; you can actually do it with just a few lines of Perl.



#!/usr/bin/perl
use LWP::Simple;
use File::Basename;
require HTML::LinkExtor;
$checkfile=shift @ARGV;
$p = HTML::LinkExtor->new(undef, "http://aplawrence.com");
$p->parse_file($checkfile);  
@links=$p->links;
 foreach (@links) {
	$link=@$_[2];
	chomp $link;
         get($link) or print "bad $link in $checkfile\n";
}
 

The only problem with that is that it might be silly to go out to the web if you are sitting on the server where the html files live. It's not so horrible if you are checking a handful of files, but that script could take a long, long time to check a big site.

A quick web search turns up a nice Perl based link checker that can check local files or go out to the web. You can find Unix, Linux and Windows versions at Linklint - fast html link checker. This is nice stuff - it's fast, flexible and it works. There's another one at LinkChecker.

Of course I can't use those.

Well, maybe I could, but I'd have to figure the thing out, read the code, probably modify it.. it's easier to just write my own. I get what I want, if I need to change it I understand the code, and it doesn't have to do anything I don't need to do. To me, that's better, but if you just need something now, I really would recommend taking a look at LinkLint.

My approach is a bit different. First, as I intend to use this on the webserver where I am checking the files, I'll be checking files where possible. Because I'm doing that, I'll also have to check .htaccess files for redirects.

If it's not a local reference, I'll use LWP to try to fetch the file. In either case, I'll record both failure and success so that I don't have to check more than once, ever. All of this makes for speedier link checking, and speed is critical whn you have many fikes to check.

Of course you'd clear out the "known good" file now and then, because links do go bad.

As writen here, the script is designed to take two arguments - the name of the file to check and an option "debug" argument. It could be written to recursively run through your whole website, of course. It is just as easy do do something like:

for in `cat list`
do
checklinks $i
done
 

Finally, it assumes that you are sitting at the root of your website and that you have at least an .htaccess file there and possibly in subdirectories. Another way to do this would be to skip looking in .htaccess and try a LWP get if we got that far.

Some notes on the code: Normally you would call the LinkExtor constructor with the name of the domain you want prepended to relative links (href="foo.html") as I did in the first example. As this code will be looking at local files, you might think you could let it do that and just strip it off. Unfortunately, LinkExtor isn't smart enough to know that href="foo.html" is "/Unix/foo.html" if it is found in "/Unix/whatever.html". Both "foo.html" and "/foo.html" end up as "yourdomain.com/foo.html" if you pass "yourdomain.com" to the constructor.

Therefore, we have to keep track of those details in the code.

#!/usr/bin/perl
use LWP::Simple;
use File::Basename;
# usage: checklinks htmlfile <debug>
require HTML::LinkExtor;
$checkfile=shift @ARGV;
$debug=shift @ARGV;

# Change these as you wish
# Clean out as desired
$knownbad="data/knownbadlinks";
$knowngood="data/knowngoodlinks";
$badlist="data/badlinkslist"; 
# $badlist has the bad links and where found

open(GOOD,"$knowngood");
while (<GOOD>) {
 chomp;
 $goodlinks{$_}=1;
}

open(BAD,"$knownbad");
while (<BAD>) {
 chomp;
 $badlinks{$_}=1;
}
close BAD; close GOOD;

if (not -e $checkfile) {
  print "Can't find $checkfile\n";
  exit 1;
}

# Get the links
$p = HTML::LinkExtor->new();
$p->parse_file($checkfile);  
@links=$p->links;

# Are they good links?
foreach (@links) {
        $type=@$_[0];
        # I only want anchor links;
	# but you could add image "src" links too.
        next if $type ne "a";
	$link=@$_[2];

	# these are in every file
        next if $link =~ ?twitter.com/share?;
        next if $link =~ /quantcast.com/;
	
	# skip those already checked as good
        print "$link known as good\n" if ($debug and $goodlinks{$link});
        next if $goodlinks{$link};

	# and the already known bad 
        if ($badlinks{$link}) {
           print "$link in $checkfile known as bad\n" if $debug;
	   next;
        }

	# abslolute link but not ours
	if (($link=~ /^http:/ or $link=~ /^ftp:/)  and $link !~ /aplawrence.com/) {
         $badlink=0;
         $goodlinks{$link}=1;
	 # will reset later if it is not
         get($link) or $badlink=1;
         if ($badlink) {
		$newbadlinks{$link}=1;
		print "$link from $checkfile not found\n";
         }
         next;
        }

	$link=~ s/.*aplawrence.com.//;
        $goodlinks{$link}=1;
	next if $link =~ /cgi-bin/;
	next if $link =~ /favicon.ico/;
	next if $link =~ /\.jpg/;
	next if $link =~ /^\/#/;
	next if $link =~ /^#/;
	
	print "Checking $link\n" if $debug;

        # strip of any trailing tracking stuff
	$link=~ s/\.html.*/.html/;
	next if -e "$link/index.html"; 
	next if -e $link; 
        # need for relative links
	$thisdir=dirname($checkfile);
	next if -e "$thisdir/$link"; 
	next if -e "./$link"; 
	if ($link !~ /.html$/) {
          print "odd $link in $checkfile\n";
          next;
        }

	# now check .htacces
	$yes = 0;
	$page=$link;
        next if not $page;
	$page2=basename($page);
	open(I,".htaccess") or die "No htaccess";
	while (<I>) {
	 @stuff=split /\s+/;
	 $yes=1 if $stuff[2] =~ /$page/m  ;
	}
	next if $yes;
	# and any .htaccess in the subdif
	$yes=0;
	open(I,"./$dir/.htaccess") ;
	while (<I>) {
	 @stuff=split /\s+/;
	 $yes=1 if $stuff[2] =~ /$page/m  ;
	 #$yes=2 if $stuff[2] =~ /$page2/m;
	}
        next if $yes;
	
	print "can't find $link from $checkfile\n" if $debug;
        $newbadlinks{$link}=1;
	}

# write out the results
open(GOOD,">$knowngood");
open(BAD,">$knownbad") or die "Cannot $! $knownbad";
open(BADLIST,">>$badlist") or die "Cannot $! $badlist";
foreach (keys %newbadlinks) {
   # cannot be good if in bad list
   delete $goodlinks{$_};
   print BAD "$_\n" if not $badlinks{$_};
   print BADLIST "$checkfile\t$_\n";
}

foreach (keys %badlinks) {
   print BAD "$_\n"
}


foreach (keys %goodlinks) {
  print GOOD "$_\n";
}

close GOOD;
close BAD;
 

Since I wrote this, W3C has a link checker: Link Checker. It's slow, but very thorough. I can ask it to check using the Printer Friendly links at the bottom of each page; that works very quickly.



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> A smarter html link extractor


4 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Thu Mar 17 01:23:19 2011: 9384   MikeHostetler

gravatar


You should put this code on github -- makes it easy for other to grab and modify.





Thu Mar 17 02:15:14 2011: 9385   TonyLawrence

gravatar


No, this is just introductory stuff to show people how to get started. It's never good code and isn't even intended to be.







Wed Jul 20 16:19:40 2011: 9647   anonymous

gravatar



lynx --dump ' (link) |perl -ne 'print if s/^\s*\d*\. ([http|ftp].*)/$1/' |sort -u

;-)



Wed Jul 20 16:31:04 2011: 9648   TonyLawrence

gravatar


You didn't read anything more than the headline, did you?

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





If we define Futurism as an exploration beyond accepted limits, then the nature of limiting systems becomes the first object of exploration. (Frank Herbert)

That's the thing about people who think they hate computers. What they really hate is lousy programmers. (Larry Niven)








This post tagged:

Web