APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

A mistaken grep

A recent post at Unix.com was from someone having difficulty with "grep". This happened to be on Mac OS X, but it really could have happened almost anywhere, even on Windows. The poster was trying to grep a string from a Neo Office document and of course not getting great results.

Apparently he'd gotten wind of grep from a brief mention in Pogue's The Missing Manual. Admittedly, Pogue isn't very clear about things; he says something like "its search material can be part of any file. especially plain text files" (emphasis mine). I'm not here to beat up on Pogue for his poor comprehension of Unix utilities, but the "especially text files" might have been what prompted this person to follow up with this:

(From post at "proper syntax of grep command" - Unix.com)



I just copied the NeoOffice file (saved as .doc, a word format) to text edit (a .txt file) and the command worked (grep 'I am writing') but it printed the whole letter, or most of it -- which I'm guessing means that it found all the lines with any of the three words in the string and printed those lines. Is it possible to use 'grep' to find this particular sentence fragment and no other lines which don't contain this entire fragment?

You and I know that COPYING a file to something with a .txt extension changes nothing. It's still a .doc file and always will be. If he or she wanted a text file, they needed to do a "Save As" from Neo and choose a text format to save into. Then and only then is grep likely to return the expected results.

However, is there anything basically wrong with the thought process that happened here? Would it be fair to say "You just aren't getting it - you are thinking of files incorrectly"? True, they ARE not understanding what grep expects, but is that their fault or grep's?

After all, there is precedence for programs behaving differently when they have different names. On many systems, "ls" and "lc" are the same binary hard linked to two names. Invoked as "lc", the binary acts as though it has been given the "-C" flag (on OS X that's the default for terminal output anyway). If a binary can behave differently based on its name, why can't a file do the same?

Well, hold on there: a binary is a program. A plain file is just a static collection of bytes. It's unreasonable to think that it could present different data - it only HAS one data set!

Well, no, that's not necessarily true, especially on OS X. What about resource forks? What about meta data? Would it be entirely unreasonable for a file to expose different parts of its data based on the name used to access it? I'd say no, that's not at all unreasonable.

It's also not unreasonable to think that programs could treat files differently based on names. Why not? Why couldn't "grep" on a .doc file be designed to ferret out paragraphs while reverting to "normal" behavior for text files? Of course it could. It could do OCR on image files or strip out certain colored pixels - I can imagine all sorts of useful things grep could do for many different files and I can certainly see it being designed to act differently on identical data presented under a different name. Of course it doesn't work that way, but it COULD. The fact that I can't think of any program that does treat data differently based on name is not important either: such programs could be written and the paradigm could actually be useful.

So, our would-be grep user needs to learn a few things about what Unix utilities actually expect. That's fine. But maybe we can learn a few things too. Maybe a little mind shift on our part might actually turn into something useful.



Got something to add? Send me email.


6 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Wed Mar 4 17:41:36 2009: 5598   drag

gravatar
I don't know about Neooffice, but with OpenOffice.org and such the documents are XML files that are compressed with normal 'zip' compression.

So you could unzip the archive first then grep through it.



Wed Mar 4 19:12:02 2009: 5599   TonyLawrence

gravatar
Nope.

Not without a bit more work - the "contents.xml" is big long strings..

$ wc content.xml 2 58680 584113 content.xml









Sat Mar 7 17:43:58 2009: 5627   CorkyAgain

gravatar
Sure, grep could have been written to do different things for different input file types. But that would violate the Unix philosophy of using simple programs that each do one thing and do it well.

The result would probably be a monstrously huge program. Every time you wanted to grep a simple text file or stream, you'd be loading all of the code needed to work with .doc files, image files or whatever else you can think of searching.

For the task of grepping a simple text file, all that unused code would be nothing but flab.

Perhaps your SuperGrep program could put all that stuff in shared libraries and load it only when needed. But that means an increase in program complexity and exposes you to all the well-known problems maintaining shared libraries. ("DLL hell".)

The Unix way would be to write separate search utilities for each file type, and then launch the appropriate one from a simple script.



Sat Mar 7 18:21:57 2009: 5628   TonyLawrence

gravatar
Oh, absolutely. I'm not an advocate of Swiss Army knife programs. My point is that the apparent expectation isn't entirely unwarranted.



Sun Mar 8 20:25:59 2009: 5632   ToddDeshane

gravatar
I don't know how good the current status is, but in 2005 at Linux World, I saw Bill Hilf from the Microsoft open source lab demonstrate Monad, which I guess is now called Powershell. Monad was a C# tool, that had unix like functionality, the cat command and others that I can't remember, but it was able to cat a word doc. I don't know if they still have that kind of support in powershell or not.
Also see:
(link)



Sun Mar 8 23:35:06 2009: 5635   TonyLawrence

gravatar
Sure - Microsoft is the king of Swiss Army knives.

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





Today, kernels are too much obedient servants, blindly doing the bidding of any program that asks. (Tony Lawrence)





This post tagged: