# # Cleaning up a large web site for Google
APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Snarling Panda site cleanup

I've removed advertising from most of this site and will eventually clean up the few pages where it remains.

While not terribly expensive to maintain, this does cost me something. If I don't get enough donations to cover that expense, I will be shutting the site down in early 2020.

If you found something useful today, please consider a small donation.



Some material is very old and may be incorrect today

© March 2011 Anthony Lawrence

I decided to be proactive and start cleaning up this very ancient site (I started here in 1997, and there are actually pages that came from an even earlier site). My reasoning is that while I may not have been deeply damaged by the Google Snarling Panda update, I have seen some ill effect and I suspect it will only get worse over time.

The problem is "low value content". That's tough to define absolutely, but some pages here definitely fall into that category no matter how loosely we might apply standards.

So, with some reservations, I decided to begin weeding the site. My reservations come from the potential historical value of some pages - I really don't like to take anything out. On the other hand, if that content has the potential to damage the perceived value of other content, it has to go.

Or does it? With regard to search engines, it's possible to add a meta tag that tells search engines not to index a particular page. That looks like this:



<meta name="robots" content="noindex">
 

You can also exclude files and even whole directories in a "robots.txt" file. These methods might help stop search engines from having a bad opinion of your site, but they won't change a humans opinion at all.

I therefore reluctantly decided to begin the clean up.

Where to begin?

I decided that although I do need to look at everything, I should begin with the shortest articles. With so many thousands of pages, it could take me months (perhaps years) to visit and assess every page, so starting with short pages makes sense. The content of these pages might be able to be merged with other pages on a similar subject or they might simply need deletion. In either case, I might want to add an entry to my .htaccess files to redirect the former page to another place.

The redirection idea did give me some cause for thought. It is certainly possible to redirect every deleted page, even if there really is no appropriate place to send it. I know some websites do that, but it seems both dishonest and annoying to me. I decided that a deleted page is a deleted page and to let the 404's fall where they may.

On the other hand, if a redirection is reasonable, I definitely would want to do that.

Sorting by size

So, the first task was to sort pages by size. You could do that with a simple "find" using "-exec wc -w {} \;", but I wrote a Perl script instead to avoid bothering to list pages above a certain size - I'll get to those eventually, but I wanted to work on the shortest pages first.

The results were somewhat daunting. Although I initially set a fairly low cut off (300 words), my list was over 6,000 pages. That's way too many to have any hope of hand examination, but I quickly realized that many of those were in directories I could either remove entirely or referenced pages I expect to be short (like the pages in Tests). After removing the unneeded directories and making adjustments to the script to skip Tests, I ran it again and came up with a more manageable number that was slightly in excess of 2,000 pages.

That's still a pretty large number. Even if I could average one page a minute, that's more than 30 hours of work. As it turned out, that one page per minute was a pretty fair estimate: while I could dispose of some pages very quickly, others took much more time: it took about a week to make a first pass at those pages.

The resulting file consisted of lines like this:

Basics/readlink.html
Unixart/unixarticles.html
Unixart/gmail_domains.html
Unixart/simh.html
Unixart/watchdog-panic.html
Unixart/xtty.html
Unixart/novell_owns_it.html
Unixart/value_text.html
Unixart/pnmouse.html
 

Some scripting help

I used my Mac OS X machine to assist in the processing. This little shell script I called "checkweb" handled everything I needed:

file=$1
echo $file | pbcopy 
echo $file
open  http://aplawrence.com/$file
grep $file Google_Links
checklinks $file
grep $file my_htaccess
 

I ran that with

for i in `cat shortlist`
do
checkweb $i
read ak
done
 

The pbcopy is a Mac OS X utility that simply copies text to the clipboard, which is convenient for whatever action I might take, whether editing or deleting.

The "open" (another Mac OS X utility command) causes Chrome (my default browser) to pen a new tab that displays the page. Back at the terminal, I also check to see if anyone else links to this page by grepping from a list provided by Google Webmaster Tools.

The "checklinks" is a modification of the link checker I described at Smarter HTML Link Extractor.

Finally, I check if it is already in a combined copy of my .htaccess files either as a source or a target, as that information could certainly affect my decisions.

A good start, but much more to do

This script made the work relatively painless and I was able to merge or outright delete many hundreds of pages in the first week. I was also able to spot page errors and make some quick corrections. but as I was anxious to get the bulk of this work done quickly, I left the less onerous errors and fixups for another day. There were also some pages where I just could not reach an instant decision as to what I wanted to do with them; I created a "Later" file and added those to that and quickly moved on to the next page on the list.

So, that's where we are now: a few thousand pages removed (yes, I still have copies, of course) and a few hundred others modified. I'll keep working at it until it's done or I lose my mind, whichever comes first.

If you found something useful today, please consider a small donation.



Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

->
-> Cleaning up a large web site for Google


Inexpensive and informative Apple related e-books:

Photos for Mac: A Take Control Crash Course

Take Control of IOS 11

Take Control of Apple Mail, Third Edition

Take Control of Upgrading to El Capitan

Take Control of High Sierra





More Articles by © Anthony Lawrence





Printer Friendly Version

Have you tried Searching this site?

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us


Printer Friendly Version





Legend has it that every new technology is first used for something related to sex or pornography. That seems to be the way of humankind. (Tim Berners-Lee)




Linux posts

Troubleshooting posts


This post tagged:

Code

Perl

Programming

Web/HTML



Unix/Linux Consultants

Skills Tests

Unix/Linux Book Reviews

My Unix/Linux Troubleshooting Book

This site runs on Linode