(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version



Fixing 404 errors

Tue Jul 13 17:19:31 2004 Fixing 404 errors
Posted by Tony Lawrence
Search Keys: 404 ,error_log, custom 404, web/html
Referencing: /Unix/custom404.html

A 404 error is what you get when your browser tries to access a page that doesn't exist. Maybe you mistyped something, or the link you followed was mistyped by someone else, or maybe the webmaster moved it or renamed it or just deleted it. It's annoying for you, and sites that care about your visit try to avoid it happening.



Well, we can't stop 404's 100%, and frankly dealing with it is an annoyance for those of us maintaining the website too. It's bad enough that other sites cause us problems with incorrect links, but it is really annoying when we cause our own problems.

Unfortunately, tracking these things down and fixing them is a bit of a pain. The "Custom 404" page and associated script referred to above corrects a lot of common errors automatically, and tries to offer help when it can't just redirect you to the right page, but I need to keep updating it as I find new sources of errors. Sometimes the fix is as simple as just making a symbolic link, but if it is from an outside source, I want to correct it if I can. Even if it was caused by my own error, I may still want to add correction code in case that original error gets picked up by someone else.

So, to help me find errors, I have a Perl script that reads in the error_log, and compares it to a log of "corrections" already made by the Custom 404 script (this is necessary because the 404 ends up in my logs even though it was corrected). The script ignores pages that have already been corrected, and spits out a list of 404's I need to at least investigate. Many of these will be confused web spiders - it's really amazing how dumb some of these things are. For example, /MacOSX/macosxcupstofile.html contains this text:

sudo lpadmin -p tofile -E -v socket://localhost:12000 -m raw
 

Dumb spiders regularly think that is a link:

[Sun Jul 11 07:07:05 2004] [error] [client 217.107.152.79] File
does not exist:
/usr/local/www/vhosts/vps.pcunix.com/htdocs/MacOSX/socket://localhost:12000/
 

I have the script count the number of uncorrected 404 occurences so that I can devote immediate effort to the more serious problems. The output of the script might look something like this:

/blog/b930.html 2
/SCOFAQ/news:comp.unix.admin 1
/cgi-bin/fmail.pl 1
/Books/creatingcoolwebsites.html 10
/e51/SCOFAQ/FAQ_scotec8xsession.html 1
 

Obviously I need to jump on that "creatingcoolwebsites.html" problem right away.

See that "fmail.pl"? That's a script kiddy trying to break in:

205.158.224.234 - - [12/Jul/2004:12:22:04 +0000] "POST /cgi-bin/fmail.pl
HTTP/1.0" 404 2317 " http://aplawrence.com/"  "-"
 

Checking his other attempts proves it:

205.158.224.234 - - [12/Jul/2004:12:21:05 +0000] "POST /cgi-bin/formmail.pl
HTTP/1.0" 404 2320 " http://aplawrence.com/"  "-"
205.158.224.234 - - [12/Jul/2004:12:22:04 +0000] "POST /cgi-bin/fmail.pl
HTTP/1.0" 404 2317 " http://aplawrence.com/"  "-"
 

Nothing to worry about there.

The actual script is pretty simple:

#!/usr/bin/perl
# ck404.pl
open(LOG,"www/logs/error_log");
open(C,"www/data/corrections");
%foo=();
%foo2=();
while(<C>) {
 chomp;
 s/->.*//;
 s/^  *//;
 s/  *$//;
 $foo{$_}=$_;
}
close C;
while(<LOG>) {
  chomp;
  s/.*htdocs//;
  s/.*cgi-bin/\/cgi-bin/;
  s/^  *//;
  s/  *$//;
  next if $foo{$_};
  $foo2{$_}++;
}
foreach (keys %foo2)  {
  print "$_ $foo2{$_}\n";
}
 

This does generate some extra garbage now and then; it doesn't need to be perfect - it's just a helper script that saves me time.

Well, I've got a few hundred 404's I need to go look at..most of them will probably be spider errors, or things I can easily fix, but invariably there will be some new 404 mixup to deal with, and the Custom 404 code will grow some more.








Click here to add your comments

---July 14, 2004

It seems to me that links in web pages are just another type of GOTO (http://www.acm.org/classics/oct95/) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh

---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---July 14, 2004



---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---

---July 14, 2004

You want looping web pages? :-)

--TonyLawrence



---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---

---July 14, 2004

You want looping web pages? :-)

--TonyLawrence

"It seems to me that links in web pages are just another type of GOTO..."

while (err == 404) {
printf("I'm dumb and I don't know it.\n");
}

There! That took care of the GOTO problem.

--BigDumbDinosaur

---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---

---July 14, 2004

You want looping web pages? :-)

--TonyLawrence

"It seems to me that links in web pages are just another type of GOTO..."

while (err == 404) {
printf("I'm dumb and I don't know it.\n");
}

There! That took care of the GOTO problem.

--BigDumbDinosaur

---July 14, 2004

I don't know who you meant to say is the dumb one, but here I'sd say 50% of 404's are my fault, another 20% are some other website's fault (though some of those come from original sin on my part), 20% are dumb or confused spiders, and 5% are fat-fingered typists or people making their best guess at something they know is there (hint - that's what the Search is for!). The remaining 5% is break-in attempts - script kiddies, etc.

These percentages are derived from my imagination mostly and some perusal of logs.


--TonyLawrence



---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---

---July 14, 2004

You want looping web pages? :-)

--TonyLawrence

"It seems to me that links in web pages are just another type of GOTO..."

while (err == 404) {
printf("I'm dumb and I don't know it.\n");
}

There! That took care of the GOTO problem.

--BigDumbDinosaur

---July 14, 2004

I don't know who you meant to say is the dumb one, but here I'sd say 50% of 404's are my fault, another 20% are some other website's fault (though some of those come from original sin on my part), 20% are dumb or confused spiders, and 5% are fat-fingered typists or people making their best guess at something they know is there (hint - that's what the Search is for!). The remaining 5% is break-in attempts - script kiddies, etc.

These percentages are derived from my imagination mostly and some perusal of logs.


--TonyLawrence

The "dumb" reference was to the computer, not you.

Given the amount of stuff you have on this site, I'm amazed that so few bad links actually exist. So an occasional 404 is hardly anything to fret over.

--BigDumbDinosaur

---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---

---July 14, 2004

You want looping web pages? :-)

--TonyLawrence

"It seems to me that links in web pages are just another type of GOTO..."

while (err == 404) {
printf("I'm dumb and I don't know it.\n");
}

There! That took care of the GOTO problem.

--BigDumbDinosaur

---July 14, 2004

I don't know who you meant to say is the dumb one, but here I'sd say 50% of 404's are my fault, another 20% are some other website's fault (though some of those come from original sin on my part), 20% are dumb or confused spiders, and 5% are fat-fingered typists or people making their best guess at something they know is there (hint - that's what the Search is for!). The remaining 5% is break-in attempts - script kiddies, etc.

These percentages are derived from my imagination mostly and some perusal of logs.


--TonyLawrence

The "dumb" reference was to the computer, not you.

Given the amount of stuff you have on this site, I'm amazed that so few bad links actually exist. So an occasional 404 is hardly anything to fret over.

--BigDumbDinosaur

---July 15, 2004

I know you meant the computer :-)

But really - 50% or more are my own fault, so the real dumb one is me..

:-)

--TonyLawrence




---July 14, 2004

It seems to me that links in web pages are just another type of GOTO ( http://www.acm.org/classics/oct95/ ) which is, as we all know, considered harmful. Maybe some day web pages will have proper control structures.

dhh
---

---July 14, 2004

You want looping web pages? :-)

--TonyLawrence

"It seems to me that links in web pages are just another type of GOTO..."

while (err == 404) {
printf("I'm dumb and I don't know it.\n");
}

There! That took care of the GOTO problem.

--BigDumbDinosaur

---July 14, 2004

I don't know who you meant to say is the dumb one, but here I'sd say 50% of 404's are my fault, another 20% are some other website's fault (though some of those come from original sin on my part), 20% are dumb or confused spiders, and 5% are fat-fingered typists or people making their best guess at something they know is there (hint - that's what the Search is for!). The remaining 5% is break-in attempts - script kiddies, etc.

These percentages are derived from my imagination mostly and some perusal of logs.


--TonyLawrence

The "dumb" reference was to the computer, not you.

Given the amount of stuff you have on this site, I'm amazed that so few bad links actually exist. So an occasional 404 is hardly anything to fret over.

--BigDumbDinosaur

---July 15, 2004

I know you meant the computer :-)

But really - 50% or more are my own fault, so the real dumb one is me..

:-)

--TonyLawrence



-----

Actually, this isn't the place to post a request for help -- newsgroups and forums are where you should to go. Also, if all else fails, RTFM! Whatever software you are using to compose and expose your pages will (or should) have the info you need. If it doesn't, perhaps you shouldn't be using it, eh?

However, to address your particular situation, you might wish to carefully examine how you set up your links. It could be that they point to locations relative to the filespace in which the pages were generated. The result would be that everything would work for you, but when pages are requested by a remote client, the links will not make any sense to that client.

Links are relative to the document base, which if confined entirely to the realm of a PC running Windows (I assume that's what you have, since you mentioned "explorer"), would be c:\somedir\somepage.html. That's not going to mean much to someone coming in over the Internet, eh? That's because the webserver confines its knowledge of the filesystem to a particular area. For example, on my webserver, my website (www.bcstechnology.net) is stored in /bcs00/port80.d/bcs.d (obviously not Windows). That is referred to as the document home. From the server's perspective, the document home is the root of the entire website. So any link you might select on my site will be refer to //whatever.html, not /bcs00/port80.d/bcs.d/whatever.html.

BTW, you shouldn't be using Internet Exploder to testing a web site. Just because it works with IE doesn't mean that the page structure conforms to W3C standards. You might be dismayed to discover that your site isn't accessible to those who are smart enough to stay away from Microsoft's super-lame browser.

--BigDumbDinosaur


-----

Actually, this isn't the place to post a request for help -- newsgroups and forums are where you should to go. Also, if all else fails, RTFM! Whatever software you are using to compose and expose your pages will (or should) have the info you need. If it doesn't, perhaps you shouldn't be using it, eh?

However, to address your particular situation, you might wish to carefully examine how you set up your links. It could be that they point to locations relative to the filespace in which the pages were generated. The result would be that everything would work for you, but when pages are requested by a remote client, the links will not make any sense to that client.

Links are relative to the document base, which if confined entirely to the realm of a PC running Windows (I assume that's what you have, since you mentioned "explorer"), would be c:\somedir\somepage.html. That's not going to mean much to someone coming in over the Internet, eh? That's because the webserver confines its knowledge of the filesystem to a particular area. For example, on my webserver, my website (www.bcstechnology.net) is stored in /bcs00/port80.d/bcs.d (obviously not Windows). That is referred to as the document home. From the server's perspective, the document home is the root of the entire website. So any link you might select on my site will be referring to //whatever.html, not /bcs00/port80.d/bcs.d/whatever.html.

BTW, you shouldn't be using Internet Exploder to testing a web site. Just because it works with IE doesn't mean that the page structure conforms to W3C standards. You might be dismayed to discover that your site isn't accessible to those who are smart enough to stay away from Microsoft's crappy and bug-ridden browser.

--BigDumbDinosaur


-----

Actually, this isn't the place to post a request for help -- newsgroups and forums are where you should to go. Also, if all else fails, RTFM! Whatever software you are using to compose and expose your pages will (or should) have the info you need. If it doesn't, perhaps you shouldn't be using it, eh?

However, to address your particular situation, you might wish to carefully examine how you set up your links. It could be that they point to locations relative to the filespace in which the pages were generated. The result would be that everything would work for you, but when pages are requested by a remote client, the links will not make any sense to that client.

Links are relative to the document base, which if confined entirely to the realm of a PC running Windows (I assume that's what you have, since you mentioned "explorer"), would be c:\somedir\somepage.html. That's not going to mean much to someone coming in over the Internet, eh? That's because the webserver confines its knowledge of the filesystem to a particular area. For example, on my webserver, my website (www.bcstechnology.net) is stored in /bcs00/port80.d/bcs.d (obviously not Windows). That is referred to as the document home. From the server's perspective, the document home is the root of the entire website. So any link you might select on my site will be referring to //whatever.html, not /bcs00/port80.d/bcs.d/whatever.html.

BTW, you shouldn't be using Internet Exploder to testing a web site. Just because it works with IE doesn't mean that the page structure conforms to W3C standards. You might be dismayed to discover that your site isn't accessible to those who are smart enough to stay away from Microsoft's crappy and bug-ridden browser.

--BigDumbDinosaur

---December 29, 2004

Right. http://aplawrence.com/forum.html is our forum here.

--TonyLawrence


-----

Actually, this isn't the place to post a request for help -- newsgroups and forums are where you should to go. Also, if all else fails, RTFM! Whatever software you are using to compose and expose your pages will (or should) have the info you need. If it doesn't, perhaps you shouldn't be using it, eh?

However, to address your particular situation, you might wish to carefully examine how you set up your links. It could be that they point to locations relative to the filespace in which the pages were generated. The result would be that everything would work for you, but when pages are requested by a remote client, the links will not make any sense to that client.

Links are relative to the document base, which if confined entirely to the realm of a PC running Windows (I assume that's what you have, since you mentioned "explorer"), would be c:\somedir\somepage.html. That's not going to mean much to someone coming in over the Internet, eh? That's because the webserver confines its knowledge of the filesystem to a particular area. For example, on my webserver, my website (www.bcstechnology.net) is stored in /bcs00/port80.d/bcs.d (obviously not Windows). That is referred to as the document home. From the server's perspective, the document home is the root of the entire website. So any link you might select on my site will be referring to //whatever.html, not /bcs00/port80.d/bcs.d/whatever.html.

BTW, you shouldn't be using Internet Exploder to testing a web site. Just because it works with IE doesn't mean that the page structure conforms to W3C standards. You might be dismayed to discover that your site isn't accessible to those who are smart enough to stay away from Microsoft's crappy and bug-ridden browser.

--BigDumbDinosaur

---December 29, 2004

Right. http://aplawrence.com/forum.html is our forum here.

The most common error is just leaving off a slash:

  Unixart/ksh.html vs. /Unixart/ksh.html

The non-slash will work if called from /index.html but not from /Unixart/index.html

--TonyLawrence






-----

Actually, this isn't the place to post a request for help -- newsgroups and forums are where you should to go. Also, if all else fails, RTFM! Whatever software you are using to compose and expose your pages will (or should) have the info you need. If it doesn't, perhaps you shouldn't be using it, eh?

However, to address your particular situation, you might wish to carefully examine how you set up your links. It could be that they point to locations relative to the filespace in which the pages were generated. The result would be that everything would work for you, but when pages are requested by a remote client, the links will not make any sense to that client.

Links are relative to the document base, which if confined entirely to the realm of a PC running Windows (I assume that's what you have, since you mentioned "explorer"), would be c:\somedir\somepage.html. That's not going to mean much to someone coming in over the Internet, eh? That's because the webserver confines its knowledge of the filesystem to a particular area. For example, on my webserver, my website (www.bcstechnology.net) is stored in /bcs00/port80.d/bcs.d (obviously not Windows). That is referred to as the document home. From the server's perspective, the document home is the root of the entire website. So any link you might select on my site will be referring to //whatever.html, not /bcs00/port80.d/bcs.d/whatever.html.

BTW, you shouldn't be using Internet Exploder to testing a web site. Just because it works with IE doesn't mean that the page structure conforms to W3C standards. You might be dismayed to discover that your site isn't accessible to those who are smart enough to stay away from Microsoft's crappy and bug-ridden browser.

--BigDumbDinosaur

---December 29, 2004

Right. http://aplawrence.com/forum.html is our forum here.

The most common error is just leaving off a slash:

Unixart/ksh.html vs. /Unixart/ksh.html

The non-slash will work if called from /index.html but not from /Unixart/index.html

--TonyLawrence





---December 29, 2004





Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar



Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.


book graphic unix and linux troubleshooting guide

My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!



 I sell and support
 Kerio Mail server




pavatar.jpg
More:
       - Code
       - Perl
       - Web/HTML
       - Blog


Unix/Linux Consultants

Skills Tests

Guest Post Here











My Favorites

Change Congress