Text vs. Binary Data formats

When I'm doing some project that requires storing and retrieving data, I usually have a mental argument about how to structure the files. Should I use flat text, or some binary format (typically Perl dbm files)?

The answer almost always should be "flat text", but old habits die hard. I started doing this stuff when computers were much slower in every respect, and were severely limited (by todays standards) in both disk and ram. As hard as it may be to imagine, storing a year in two digits or four was an important decision - with a 100,000 record database storing just 4 date fields, you'd waste almost a megabyte of space if you went with four digits. When hard drives were 10 MB, that was a definite "think this over" because not only were you using a lot of expensive space, but the increased file size would definitely mean longer access times. Hard drives were painfully slow, so you'd want to bring things into ram whenever possible, but 16 MB of ram was a lot!

Those days are long gone. An extra megabyte of space is meaningless; an extra gigabyte is nearly meaningless. Speed has improved on all fronts; not as much on disk drives as in ram and cpu speeds, but even disk drives have improved quite a bit. In fact, for any interactive use, it is extremely unlikely that your users would ever be able to observe any speed differences from using flat files - and that's probably true no matter how clumsily you program it!

As a test, I created a 45 MB text file that started each line with a randomly generated number. There were 65,536 lines. I wrote a horribly inefficient "search" program:

#!/usr/bin/perl
$found=0;
$now=time();
$fnow=$now;
while (@ARGV) {
 $found=0;
 $x=shift @ARGV;
 open(O, "t.dat");
 while (<O>) {
   if (/^$x / ) {
     $found++;
   }
 }
$now=time();
$elapsed=$now - $fnow;
print "Found $found matches in $elapsed seconds \n"; 

}
 

Notice that this program does nothing to help itself: it reopens the file for every search, there aren't any indexes built to help find data, the file isn't sorted - it's about as bad as you can get. Yet, on my underpowered iBook, it takes very little time to find matching records for any number:

bash-2.05a$ ./t.pl  668453        
Found 69 matches in 2 seconds 
 

Sure, you could do it much faster with a B-Tree, but would it really matter? And if it did, there are numerous ways you could rewrite this to be much, much faster. A small startup delay could give you an in-memory index that would let you seek to any record very rapidly, and even a file of this size can probably be brought right into ram on many machines. You could build external index files (which should also be flat text!) for extremely fast access.

Yes, there are applications where you really need to use something faster. I wouldn't try to run an airline reservation system this way - though if the airline were small enough, you certainly could. But so many other applications can be done this way, and it is a shame that they are not. Why? Because the flat files are easily accessible with other tools, both for reading and for modification. Use grep, vi, sort: whatever. Text is universal, too: you won't have problems transporting your data to some other platform if it's text: "987.34" is "987.34" regardless of big-endian or little endian, regardless of a floating point number being stored in 32 bits or 64.

There are tradeoffs, of course. Perl dbm files are very nice and easy to work with: your access is through a hash array, Perl handles all the nasty details; you just open the database, read and write keys in the hash, and close it. Yet every time I do this (the temptation overwhelms me too often), I later find myself writing stupid programs to do things I could do in an instant with ordinary Unix tools if I had just done it with text files from the beginning.

All right, I'm making an early New Years resolution: no more non-text formats. It's text for me from now on. I also have every intention (which means it probably won't happen, but I have INTENTIONS) of rewriting the non-textual stuff I have now.



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Text vs. Binary Data formats


5 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Tony Lawrence



I tend to agree, in that flat files, along with judicious use of common UNIX tools, can make for a decent database system.

However, where one would probably get into trouble with this approach would be when two or more users wanted to simultaneously update the database in some fashion. UNIX will not prevent write-conflicts to the same file, which means some kind of locking system would have to be devised to assure user A's changes weren't trashed when user B wrote out his.

The other consideration would be in retrieving an individual record. A technique like a binary search would not be practical as text file tools aren't equipped to perform random access on flat files. A sequential search would be required, which could become very slow, epecially if the desired record is near the end of the file.

--BigDumbDinosaur


I'm not saying to program the database access using Unix shell tools: Perl, C, whatever can provide the locking mechanisms.

As to the search techniques, nothing stops you from doing a binary search on a text file - it simply needs to be sorted.

Finally, as the example demonstrates, even brute force searches just are not all that slow..

--TonyLawrence


The last paragraph from "Dinosaur" is exactly why many programmers make bad decisions about data storage formats: they do not consider all factors.

I agree with Tony, there is no reason why you cannot access a flat file with a binary search. The only exception would be if it were on a strictly sequential medium, such as a magnetic tape file. Even that problem can be overcome by many utilities, which would read the tape file quickly to a temporary disk file, which would then be accessible by a binary search.

With the increases of cheap storage space and access speeds, their is RARELY sufficient benefit from storing in binary to justify the the long-term costs of additional time required during maintenance and support activities. Remember 90% of development costs are actually in support of the developed application.

Bonnie Phillips, Denver.


---October 20, 2004

It's funny, that you choose floating point numbers as an example for portability. Floating point numbers cannot be stored in decimal format without conversion losses. If you write 9.43 in a file, you might get 9.429999965 or something like this after processing it once.

What can you do about this? Should you store it as decimal in memory? Keep the text representation along with the binary in memory? Allow small errors if checking for equality? Round before writing it again? Forbid calculations on floats? Ignore it? You will get into a hell of subtile errors because of this problem.

Additionally float parsing *is* tedious and slow. I recently moved a text format to store geographical data to binary. Filesize dropped by 50% and loading time improved from ten seconds to one on my own underpowered IBook. It was clearly noticable. That was a definitive indication to use binary.

---Shadow

---October 20, 2004


---December 29, 2004

The fact that there are still cases where going for binary may be a better choice does not void the basic argument that high-level, i.e. textual, representations should be used wherever possible. On the other hand we all know that many real-world data objects are binary by their very nature, like image files, digital audio/video, and many more. Yet, the meta-data describing such objects can be more conveniently stored in text files (and in fact this is almost always the case). I believe so much in using text whenever possible that I have developed a small text-based database system that can be freely downloaded from http://www.scriptaworks.com/cgi-bin/wiki.sh/NoSQL/HomePage . It is modelled after other people's ideas so I'm not begging for credits here :-)

--Carlo



------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. (Donald Knuth)

I am not out to destroy Microsoft, that would be a completely unintended side effect. (Linus Torvalds)








This post tagged: