Recently I had downloaded a csv file with the intention of extracting some data to satisfy my curiousity about something. I wrote a little Perl script to slice and dice the data, and that would have been that - except I wanted to know something quickly from the original file, so I did something like "grep whatever 2003.csv".
I got back nothing.
That's odd, I thought. I know that "whatever" is in there. So I fired up vim and did "/whatever" and, sure enough, there it was.
So why couldn't I extract it with grep?
Hmm. Let's do a "more". Ooops! After warning my that "2003.csv" may be a binary file. See it anyway?, "more" showed me a mess.
Well, duh, that's why I couldn't grep from the file - the darn thing is utf-16!
So, what can you do if faced with this situation? You have a few choices. You could ask vim to rewrite it. That's easy:
:w ++enc=latin1
Vim can do all sorts of file encoding rewriting; see Using another encoding in the VIM docs.
You could use Perl to rewrite the file, though Perl has some funny ideas about what utf8 means, plus some other oddities here and there.
At the Terminal command line, you can use "iconv":
iconv -f utf-16 -t utf-8 2003.csv | grep whatever
Though that gets old fast, so I just converted the file.
Wouldn't it have been nice if we never had 7 or 8 bit encodings?
Had a client on a Red Hat system complain that he was getting an error from cron. A script in cron.weekly complained about "No such file or directory", but the file was there - it made no immediate sense.
The error seemed rather definite:
/etc/cron.weekly/procmail-users: /usr/bin/run-parts: /etc/cron.weekly/procmail-users: No such file or directory
I figured it was going to be a symlink problem or an incorrect shebang line, but no, everything looked fine, and you'd get the same error running it from the command line.
I kept looking and looking at this until I noticed that while editing it in vi, a little "[dos]" appeared next to the file name at the bottom of the screen.
Ahah. A "file proc*" confirmed that this had CRLF line endings. But normally I'd expect to see ^M's in vi; I didn't. That puzzles me a little, but since the script was just a one-liner, I removed and recreated it manually and now of course it works.
You can also do:
:set ff=unix
and then write the file, you'll convert dos or mac file endings to unix.
Of course there's :set ff=dos and :set ff=mac too.
You can be more verbose if you wish:
:set fileformat=unix
Nowadays, you may run into UTF-8 vs. UTF-16 problems too. See Converting File Encodings
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2013-07-31 Anthony Lawrence
Yeah, it's obsolete, clunky, insecure and broken, but people still use this stuff (Tony Lawrence)
Wed Aug 10 13:07:40 2005: 952 anonymous
This happened to me also. Been beating my head against the wall why CRON was complaining. This worked!!!!
Thanks,
Timothy Ste. Marie
Thu Oct 21 06:12:39 2010: 9049 Soumen
What happens if you do this?
cat file_with_non_printable_stuff.dat |grep pattern
Thu Oct 21 09:06:34 2010: 9050 TonyLawrence
No, that won't do it.
You would expect it to because "cat" will show the file, but it does not.
Wed Nov 3 22:25:26 2010: 9096 anonymous
Thanks so much for this- you hit on *exactly* what I was trying to work on, and I really appreciate you sharing this. absolutely made my day and saved me a bunch of time.
Cheers!
------------------------
Printer Friendly Version
Converting File Encodings Copyright © September 2010 Tony Lawrence
Have you tried Searching this site?
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.
Contact us
Printer Friendly Version