Converting File Encodings

Recently I had downloaded a csv file with the intention of extracting some data to satisfy my curiousity about something. I wrote a little Perl script to slice and dice the data, and that would have been that - except I wanted to know something quickly from the original file, so I did something like "grep whatever 2003.csv".

I got back nothing.

That's odd, I thought. I know that "whatever" is in there. So I fired up vim and did "/whatever" and, sure enough, there it was.

So why couldn't I extract it with grep?

Hmm. Let's do a "more". Ooops! After warning my that "2003.csv" may be a binary file. See it anyway?, "more" showed me a mess.

more of 16 bit file

Well, duh, that's why I couldn't grep from the file - the darn thing is utf-16!

So, what can you do if faced with this situation? You have a few choices. You could ask vim to rewrite it. That's easy:


:w ++enc=latin1
 

Vim can do all sorts of file encoding rewriting; see Using another encoding in the VIM docs.

You could use Perl to rewrite the file, though Perl has some funny ideas about what utf8 means, plus some other oddities here and there.

At the Terminal command line, you can use "iconv":

 iconv -f utf-16 -t utf-8 2003.csv  | grep whatever
 

Though that gets old fast, so I just converted the file.

Wouldn't it have been nice if we never had 7 or 8 bit encodings?

No such file or directory error message

Had a client on a Red Hat system complain that he was getting an error from cron. A script in cron.weekly complained about "No such file or directory", but the file was there - it made no immediate sense.

The error seemed rather definite:

/etc/cron.weekly/procmail-users:
/usr/bin/run-parts: /etc/cron.weekly/procmail-users: No such file or 
directory
 

I figured it was going to be a symlink problem or an incorrect shebang line, but no, everything looked fine, and you'd get the same error running it from the command line.

I kept looking and looking at this until I noticed that while editing it in vi, a little "[dos]" appeared next to the file name at the bottom of the screen.

Ahah. A "file proc*" confirmed that this had CRLF line endings. But normally I'd expect to see ^M's in vi; I didn't. That puzzles me a little, but since the script was just a one-liner, I removed and recreated it manually and now of course it works.

You can also do:

:set ff=unix 
 

and then write the file, you'll convert dos or mac file endings to unix.

Of course there's :set ff=dos and :set ff=mac too.

You can be more verbose if you wish:

:set fileformat=unix 
 

Nowadays, you may run into UTF-8 vs. UTF-16 problems too. See Converting File Encodings



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Converting File encodings on MacOSX UTF-16 to 8 bit ascii

4 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Wed Aug 10 13:07:40 2005: 952   anonymous


This happened to me also. Been beating my head against the wall why CRON was complaining. This worked!!!!

Thanks,
Timothy Ste. Marie






Thu Oct 21 06:12:39 2010: 9049   Soumen

gravatar


What happens if you do this?

cat file_with_non_printable_stuff.dat |grep pattern



Thu Oct 21 09:06:34 2010: 9050   TonyLawrence

gravatar


No, that won't do it.

You would expect it to because "cat" will show the file, but it does not.



Wed Nov 3 22:25:26 2010: 9096   anonymous

gravatar


Thanks so much for this- you hit on *exactly* what I was trying to work on, and I really appreciate you sharing this. absolutely made my day and saved me a bunch of time.

Cheers!

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us