If this isn't exactly what you wanted, please try our Search (there's a LOT of techy and non-techy stuff here about Linux, Unix, Mac OS X and just computers in general!):
From: Tony Lawrence <tony@aplawrence.com>
Subject: Re: Problems with backups/restores
References: <9rdoqi$ui0$00$1@news.t-online.com> <3bdec471$1_7@news.newsgroups.com>
Date: Wed, 31 Oct 2001 10:45:12 GMT
Joe DeBiso wrote:
>
> There is a concept called "Sparse Files". his is where a file has data in
> front, an index at the end with "air" in the middle. Most backup software
> will null fill the "air" making the restored file bigger.
Wow, that's the worst explanation of sparse files I've ever seen.
Sparse files were actually a space saving "trick" introduced in Unix
filesystems a long, long time ago. The need sprang from hashed files,
which is probably where "index" got into your muddled explanation.
A hash is an access/storage method where a mathematical function is
applied to a key. The number that results from that function is used as
the record number, or offset of the file. For example, suppose we had
the keys mary and tom (with associated data, of course), Our hash
function turns the word "mary" into the number 45, and turns "tom" into
128. Pretending that the data stored is 512 bytes for each record,
you'd find mary's data 512 * 45 bytes from the start of the file, and
tom's at 512 * 128 bytes. This sort of "indexing" with hashed keys
gives incredibly fast access to records (there are issues with how to
deal with keys that hash to the same value, biut we'll ignore that
here).
A good hash function is going to generate widely disparate numbers
(that's one of the ways to minimize the duplication problem). So rather
than 45 and 128, we'd really get something like 2 and 438,785. Now
suppose that these were the only data stored in the file so far: it
would be a pretty big file, over 200 megabytes (433,785 * 512), but
there's really only 1024 bytes of real data in it- a whole bunch of
wasted space.
Now we turn to the way Unix file systems work. Without getting into too
much detail, and without getting too much into the confusion of
indirect, double indirect etc here, the Unix inode has pointers to the
places where a files data can be found. The first ten pointers point
directly to data blocks, the next points to indirect blocks which in
turn point to real data blocks and so on.
So, the "mary" data ends up in the second data block (assuming 1k blocks
here) and the "tom" data ends way out in one of the double indirect
blocks somewehre. None of the other pointers are being used. No data
needs to be stored, so no need to waste space: this is a sparse file.
If you look at it with "ls -l" it looks like it's 200+ MB, but if you
removed it, you wouldn't gain 200 MB of space. If you do something that
reads it sequentially, the driver just returns nulls for the data that
isn't there. And that's the problem: ordinary tape utilities write
those nulls, using up 200+ MB of tape, and if it is restored with the
same non-aware utility, the data blocks actually get allocated and
filled with ascii 0's- now you really have used 200 MB of space.
The Supertars are smarter than this and do not write or restore the
nulls.
--
Tony Lawrence
SCO/Linux Support Tips, How-To's, Tests and more:
Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)
| Views for this page | ||||
|---|---|---|---|---|
| Today | This Week | This Month | This Year | Overall |
| 1 | 3 | 5 | 5 | 1,985 |
/Bofcusm/944.html copyright 1997-2004 (various authors) All Rights Reserved
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Click here to add your comments