When I'm doing some project that requires storing and retrieving data, I usually have a mental argument about how to structure the files. Should I use flat text, or some binary format (typically Perl dbm files)?
The answer almost always should be "flat text", but old habits die hard. I started doing this stuff when computers were much slower in every respect, and were severely limited (by todays standards) in both disk and ram. As hard as it may be to imagine, storing a year in two digits or four was an important decision - with a 100,000 record database storing just 4 date fields, you'd waste almost a megabyte of space if you went with four digits. When hard drives were 10 MB, that was a definite "think this over" because not only were you using a lot of expensive space, but the increased file size would definitely mean longer access times. Hard drives were painfully slow, so you'd want to bring things into ram whenever possible, but 16 MB of ram was a lot!
Those days are long gone. An extra megabyte of space is meaningless; an extra gigabyte is nearly meaningless. Speed has improved on all fronts; not as much on disk drives as in ram and cpu speeds, but even disk drives have improved quite a bit. In fact, for any interactive use, it is extremely unlikely that your users would ever be able to observe any speed differences from using flat files - and that's probably true no matter how clumsily you program it!
As a test, I created a 45 MB text file that started each line with a randomly generated number. There were 65,536 lines. I wrote a horribly inefficient "search" program:
#!/usr/bin/perl
$found=0;
$now=time();
$fnow=$now;
while (@ARGV) {
$found=0;
$x=shift @ARGV;
open(O, "t.dat");
while (<O>) {
if (/^$x / ) {
$found++;
}
}
$now=time();
$elapsed=$now - $fnow;
print "Found $found matches in $elapsed seconds \n";
}
Notice that this program does nothing to help itself: it reopens the file for every search, there aren't any indexes built to help find data, the file isn't sorted - it's about as bad as you can get. Yet, on my underpowered iBook, it takes very little time to find matching records for any number:
bash-2.05a$ ./t.pl 668453
Found 69 matches in 2 seconds
Sure, you could do it much faster with a B-Tree, but would it really matter? And if it did, there are numerous ways you could rewrite this to be much, much faster. A small startup delay could give you an in-memory index that would let you seek to any record very rapidly, and even a file of this size can probably be brought right into ram on many machines. You could build external index files (which should also be flat text!) for extremely fast access.
Yes, there are applications where you really need to use something faster. I wouldn't try to run an airline reservation system this way - though if the airline were small enough, you certainly could. But so many other applications can be done this way, and it is a shame that they are not. Why? Because the flat files are easily accessible with other tools, both for reading and for modification. Use grep, vi, sort: whatever. Text is universal, too: you won't have problems transporting your data to some other platform if it's text: "987.34" is "987.34" regardless of big-endian or little endian, regardless of a floating point number being stored in 32 bits or 64.
There are tradeoffs, of course. Perl dbm files are very nice and easy to work with: your access is through a hash array, Perl handles all the nasty details; you just open the database, read and write keys in the hash, and close it. Yet every time I do this (the temptation overwhelms me too often), I later find myself writing stupid programs to do things I could do in an instant with ordinary Unix tools if I had just done it with text files from the beginning.
All right, I'm making an early New Years resolution: no more non-text formats. It's text for me from now on. I also have every intention (which means it probably won't happen, but I have INTENTIONS) of rewriting the non-textual stuff I have now.
Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)
| Views for this page | ||||
|---|---|---|---|---|
| Today | This Week | This Month | This Year | Overall |
| 5 | 7 | 5 | 1,493 | 8,952 |
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Add your comments