APLawrence - Information and Resources for Unix and Linux Systems, Bloggers and the self-employed
RSS Feeds Get APLawrence.com by RSS











(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Home > Websites, Blogging > Scraping sites - original source tag?
Printer Friendly Version




Scraping sites - original source tag?



I'm somewhat disturbed but still ambivalent about the large number of scraper sites - sites with little or no original content that just reprint articles taken from other sites. There can be value to such sites in the sense that they consolidate information in specific subject areas, but what disturbs me is that often the original source isn't immediately apparent.

For example, take a look at The Linux World Learns How Larry Ellison Does Business. Until you click on the "Continue" link, it isn't at all obvious that this content is actually taken from another site. There's no indication that this is a quote - in fact, it definitely isn't a quote. I'm not convinced this extract would qualify as "fair use" either: the copyright rules for fair use seem to imply that there has to be more than just the other person's content:

.. amount and substantiality of the portion used in relation to the copyrighted work as a whole

Now of course this depends upon the license offered by the owner of the content. People often take things from here, and in most cases I'm fine with that as long as full and proper attribution back here is given with the post. Incidentally, if one of my articles was taken and presented as that Larry Ellison example above, I'd object: that wouldn't meet my terms of use.

More recently I changed my Copyright Policy. Please read that before taking any content.

But never mind that. Let's say a site does crib my content along with others and fairly presents it with no attempt at obfuscating the original source. Does that have value? Maybe, but this is where I get really wishy-washy. In theory, such an accumulation of knowledge in a particular subject area could be of value to someone wanting that area. That's particularly true when Google fails to deliver good results, either through having search terms with too many alternate meanings or because there is too much "garbage" to sift through: a human editor can do far better than Google in those circumstances.












But that sort of thing could be done just as well with links. Why is it necessary to duplicate the actual content? The answer is plain: it's necessary because that's the only way search engines will see the site as authoritative. So, as much as we (the original authors) may dislike it, we probably aren't going to stop it unless we prohibit re-use of our material outright.

And that is something I do not want to do, both on moral and practical grounds. Morally, I prefer to share, and practically I can't prevent it and actually do benefit from it (for example, I get a lot of traffic from WebProNews, a regular regurgitator of my content). I also recognize that even I use consolidation sites more than I use original source sites: it's just easier to find the things I am looking for (however, I do quote the original source if I quote anything at all).

It still annoys me greatly when something I wrote and first published here turns up in search engine results at some other site. Damn it, *I* wrote it, they didn't. The search engines should be sending the traffic to me, not to them.

I'd like to propose a solution: a simple tag system that search engines could recognize which would attribute the original source. We content authors could make inclusion of that tag a condition for republication on the web and search engines could cooperate by recognizing that tag and properly attributing the source. For example, it might be as simple as an href with specific language:

Original source http://aplawrence.com/Web/web_scrapers.html - this hyperlink must be included to republish this article on the Web.

If search engines understood the meaning of that, and would redirect subsequent traffic to the real source, I'd be a lot happier. Understand that I'm not talking about using this for "fair use" quoting, and also that I'd expect search engines to show both sources. For example, if I wrote an article about widgets that was picked up by the Widgets Today site, a search engine that had indexed that for certain terms could display both the Widgets Today link and an "Original Source" link back here. That would give full and proper credit and also give searchers a choice as to what they wished to read.

See also Does RSS Imply Permission To Reuse Content?

If you know of any efforts in this regard or have other ideas, I'd love to hear about it in the comments below - thanks!


Technorati tags:
If this page was useful to you, please click to help others find it:  

Your +1's can help friends, contacts, and others on the web find the best stuff when they search.

7 comments




More Articles by Anthony Lawrence - Find me on Google+



Click here to add your comments





Wed Nov 1 00:24:33 2006:   bruceg2004


Tony has a great idea proposed here. A tag that identifies where the content originates from, so that the author is given the correct credit. Something like a books' ISBN code could be generated, so the content would be forever "cataloged".

- Bruce



Wed Nov 1 00:53:55 2006:   TonyLawrence

gravatar
If you think the idea has value, help spread it: write about it, Digg it, whatever - maybe we can get enough interest to make it happen.



Wed Nov 1 15:19:59 2006:   bruceg2004


I did digg it right after I read it, and posted my comment yesterday.

http://digg.com/linux_unix/Scraping_sites_original_source_tag


Only three diggs so far...

- Bruce



Wed Nov 1 16:23:39 2006:   TonyLawrence

gravatar
I see that disappointing response. I think it's either my explaining the concept poorly or just general lack of interest: the content scrapers of course don't care, and at least some portion of content providers don't want to share at all, so this is meaningless to them - though many of them ARE getting scraped against their permission.

And of course the vast majority of Digg readers are neither producers nor scrapers, so why they should perhaps care on moral grounds, in practice it's just completely unimportant.







Wed Nov 1 17:43:32 2006:   TonyLawrence

gravatar
Here's someone questioning the value of Digg etc. entirely:
http://www.37signals.com/svn/posts/93-its-the-content-not-the-icons



Sun Dec 9 11:42:49 2007:   TonyLawrence

gravatar
I still think this is a good idea. Someone else recently noticed this: http://unixmouth.com/2007/11/13/scraping-sites-original-source-tag/



Sun Jan 6 23:01:23 2008:   TonyLawrence

gravatar
Not quite the same thing, but related and definitely of interest (thanks, Bruce!)

http://numly.com/

Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar



LOD Communications, Inc.

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.


My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!


book graphic unix and linux troubleshooting guide




 I sell and support
 Kerio Mail server
pavatar.jpg

This post tagged:

       - Web/HTML




Unix/Linux Consultants

Skills Tests

Guest Post Here