A recent post at Unix.com was from someone having difficulty with "grep". This happened to be on Mac OS X, but it really could have happened almost anywhere, even on Windows. The poster was trying to grep a string from a Neo Office document and of course not getting great results.
Apparently he'd gotten wind of grep from a brief mention in Pogue's The Missing Manual. Admittedly, Pogue isn't very clear about things; he says something like "its search material can be part of any file. especially plain text files" (emphasis mine). I'm not here to beat up on Pogue for his poor comprehension of Unix utilities, but the "especially text files" might have been what prompted this person to follow up with this:
(From post at "proper syntax of grep command" - Unix.com)
I just copied the NeoOffice file (saved as .doc, a word format) to text edit (a .txt file) and the command worked (grep 'I am writing') but it printed the whole letter, or most of it -- which I'm guessing means that it found all the lines with any of the three words in the string and printed those lines. Is it possible to use 'grep' to find this particular sentence fragment and no other lines which don't contain this entire fragment?
You and I know that COPYING a file to something with a .txt extension changes nothing. It's still a .doc file and always will be. If he or she wanted a text file, they needed to do a "Save As" from Neo and choose a text format to save into. Then and only then is grep likely to return the expected results.
However, is there anything basically wrong with the thought process that happened here? Would it be fair to say "You just aren't getting it - you are thinking of files incorrectly"? True, they ARE not understanding what grep expects, but is that their fault or grep's?
After all, there is precedence for programs behaving differently when they have different names. On many systems, "ls" and "lc" are the same binary hard linked to two names. Invoked as "lc", the binary acts as though it has been given the "-C" flag (on OS X that's the default for terminal output anyway). If a binary can behave differently based on its name, why can't a file do the same?
Well, hold on there: a binary is a program. A plain file is just a static collection of bytes. It's unreasonable to think that it could present different data - it only HAS one data set!
Well, no, that's not necessarily true, especially on OS X. What about resource forks? What about meta data? Would it be entirely unreasonable for a file to expose different parts of its data based on the name used to access it? I'd say no, that's not at all unreasonable.
It's also not unreasonable to think that programs could treat files differently based on names. Why not? Why couldn't "grep" on a .doc file be designed to ferret out paragraphs while reverting to "normal" behavior for text files? Of course it could. It could do OCR on image files or strip out certain colored pixels - I can imagine all sorts of useful things grep could do for many different files and I can certainly see it being designed to act differently on identical data presented under a different name. Of course it doesn't work that way, but it COULD. The fact that I can't think of any program that does treat data differently based on name is not important either: such programs could be written and the paradigm could actually be useful.
So, our would-be grep user needs to learn a few things about what Unix utilities actually expect. That's fine. But maybe we can learn a few things too. Maybe a little mind shift on our part might actually turn into something useful.

Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.
Click here to add your comments
Wed Mar 4 17:41:36 2009: Subject: drag
I don't know about Neooffice, but with OpenOffice.org and such the documents are XML files that are compressed with normal 'zip' compression.
So you could unzip the archive first then grep through it.
Wed Mar 4 19:12:02 2009: Subject: TonyLawrence
Nope.
Not without a bit more work - the "contents.xml" is big long strings..
$ wc content.xml 2 58680 584113 content.xml
Sat Mar 7 17:43:58 2009: Subject: there are other possibilities CorkyAgain
Sure, grep could have been written to do different things for different input file types. But that would violate the Unix philosophy of using simple programs that each do one thing and do it well.
The result would probably be a monstrously huge program. Every time you wanted to grep a simple text file or stream, you'd be loading all of the code needed to work with .doc files, image files or whatever else you can think of searching.
For the task of grepping a simple text file, all that unused code would be nothing but flab.
Perhaps your SuperGrep program could put all that stuff in shared libraries and load it only when needed. But that means an increase in program complexity and exposes you to all the well-known problems maintaining shared libraries. ("DLL hell".)
The Unix way would be to write separate search utilities for each file type, and then launch the appropriate one from a simple script.
Sat Mar 7 18:21:57 2009: Subject: TonyLawrence
Oh, absolutely. I'm not an advocate of Swiss Army knife programs. My point is that the apparent expectation isn't entirely unwarranted.
Sun Mar 8 20:25:59 2009: Subject: powershell (formerly monad) ToddDeshane
http://todddeshane.net
I don't know how good the current status is, but in 2005 at Linux World, I saw Bill Hilf from the Microsoft open source lab demonstrate Monad, which I guess is now called Powershell. Monad was a C# tool, that had unix like functionality, the cat command and others that I can't remember, but it was able to cat a word doc. I don't know if they still have that kind of support in powershell or not.
Also see:
http://blogs.msdn.com/powershell/archive/2008/03/23/select-string-and-grep.aspx
Sun Mar 8 23:35:06 2009: Subject: TonyLawrence
Sure - Microsoft is the king of Swiss Army knives.
Don't miss responses! Subscribe to Comments by RSS or by Email
Click here to add your comments
If you want a picture to show with your comment, go get a Gravatar