If you want to automatically process web pages to extract data, you have a number of tools available. You can bring a web page down to your computer using "curl" or "wget"
curl http:.//aplawrence.com > mysite
If you don't really want the html, use "lynx --dump http://whatever.com > /yourstorage/whatever.txt" to get a text representation of the page. Check the man page for options you might want like "--nolist" and also see lynx alternatives
You can also easily be selective and pull only the data you want from a page with simple Perl scripts.
#!/usr/bin/perl
use LWP::Simple;
$url = 'http://aplawrence.com";
$content = get $url;
print $content;
And then of course you'd process the $content as desired. It's only a little more complex if you are dealing with forms; see /Words/2005_03_05.html for a small example of that.
A book that covers LWP is reviewed at /Books/webc.html.
Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)
| Views for this page | ||||
|---|---|---|---|---|
| Today | This Week | This Month | This Year | Overall |
| 4 | 6 | 4 | 342 | 1,549 |
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Add your comments