Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
How to create UTF-8 files?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
stephandale
Tux's lil' helper
Tux's lil' helper


Joined: 28 Jan 2006
Posts: 86

PostPosted: Thu Jun 14, 2007 9:18 pm    Post subject: How to create UTF-8 files? Reply with quote

Hi all.

I've been looking into creating files with UTF-8 encoding and I'm confused! I need to create UTF-8 files for my Rails application, so that I can use one of the internationalisation (or internationalization or i18n) plugins.

I think I run Gentoo with us-ascii encoding - that's the encoding shown when I query a file with 'file -bi [filename]' - but I'm having trouble both using saving files as UTF-8 via vim and converting files to UTF-8 with the recode and convmv commands.

Does Gentoo need to be using utf-8 encoding to allow a file to be converted to utf-8? Ideally, I'd like to leave my system encoding as-is, but save certain files as UTF-8. Is this possible?

Many thanks.
_________________
Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo/
gentoo(a)mindspill.net
Back to top
View user's profile Send private message
Genone
Retired Dev
Retired Dev


Joined: 14 Mar 2003
Posts: 9192
Location: beyond the rim

PostPosted: Thu Jun 14, 2007 10:44 pm    Post subject: Reply with quote

Well, the `file` command will always tell you that a file is ASCII text if it only contains normal Ascii characters. The encoding only matters when you use characters beyond the normal english alphabet, so you can't really convert a Ascii file into a UTF-8 file, simply because they would be completely identical.
Back to top
View user's profile Send private message
stephandale
Tux's lil' helper
Tux's lil' helper


Joined: 28 Jan 2006
Posts: 86

PostPosted: Thu Jun 14, 2007 10:57 pm    Post subject: Reply with quote

Ok, that's good - I'm a step closer knowing why I always see us-ascii when I'm not expecting it!

How do I test whether a file has been saved as UTF-8? - I've changed my vim config to use UTF-8 and now I need to check whether it's actually creating files as UTF-8. Do you know of any tools I can use to do this?
_________________
Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo/
gentoo(a)mindspill.net
Back to top
View user's profile Send private message
Genone
Retired Dev
Retired Dev


Joined: 14 Mar 2003
Posts: 9192
Location: beyond the rim

PostPosted: Fri Jun 15, 2007 12:00 am    Post subject: Reply with quote

Well, you'd have to create a file that contains characters above ascii value 127 (e.g. german umlauts, french accents, or cyrillic or chinese letters), can't really tell you how to do that though if you can't enter those directly with your keyboard, you'll have to find someone else who can.
Back to top
View user's profile Send private message
jlh
Tux's lil' helper
Tux's lil' helper


Joined: 06 May 2007
Posts: 145
Location: Switzerland::Zürich

PostPosted: Fri Jun 15, 2007 12:43 am    Post subject: Reply with quote

If memory serves, just type ":set fileencoding=utf8" in vim and then save the file. (These days you might even make that a default.) Note that this is completely unrelated to the locale/encoding your shell uses, and also the encoding that vim uses internally while the file is in RAM. The above setting won't be remembered across vim session, but uppon loading a file it will be able to autodetect the format in most cases and will reuse that. But as said above, an US-ASCII files and a utf8 file that doesn't use any non-US-ASCII character are identical and then vim can't guess and will use a default.

If you really want to verify your file, you can hexdump it. If you hexdump a utf8 file, all normal character (US-ASCII) will be represented normally, as you would expect. Fancy characters like german umlautes (öäü) will be two bytes long instead. Some even fancier (but more rare, probably chinese and that stuff) characters will be more than 2 bytes long and the most fancy ones can be up to 6 bytes (probably very rare). Every byte belonging to a multi-byte character will have its high bit set. Given that, spotting utf8 in a hexdump shouldn't be too difficult. (Use google for more info how utf8 works.)

An alternative way to create utf8 (or any other encoding) files is to use iconv, which will happily convert from and to hundreds of encodings. For example "iconv -f latin1 -t utf8 < input > output". (-f and -t mean 'from' and 'to')
Back to top
View user's profile Send private message
stephandale
Tux's lil' helper
Tux's lil' helper


Joined: 28 Jan 2006
Posts: 86

PostPosted: Fri Jun 15, 2007 8:25 am    Post subject: Reply with quote

Thank you both.

I'd already set vim to use utf-8 with the options 'set encoding=utf8' and 'set fileencoding=utf-8', though as you both said, the file I created was identical to a us-ascii one because I'd not saved any unicode characters.

Trying again, creating some unicode characters (most character keys on the keyboard will create a unicode equivalent if you hold down the alt key) and the file was reported as utf-8.

Without unicode characters:

Code:
steph@localhost ~ $ file -bi utf8test.txt
text/plain; charset=us-ascii


With unicode characters:

Code:
steph@localhost ~ $ file -bi utf8test.txt
text/plain; charset=utf-8


For completeness, I tried to convert the utf8 file back to ascii, but I can't get it to work.

Code:
steph@localhost ~ $ iconv -f utf8 -t ascii utf8test.txt
abcdefghijklmnopqrstuvwxyz
iconv: illegal input sequence at position 27


Code:
steph@localhost ~ $ iconv -f utf8 -t ascii < utf8test.txt
abcdefghijklmnopqrstuvwxyz
iconv: illegal input sequence at position 27


Code:
steph@localhost ~ $ recode ascii utf8test.txt
recode: utf8test.txt failed: Invalid input in step `ANSI_X3.4-1968..CHAR'


Does anyone know how to convert from utf8 to ascii?

P.S. The test file contains the following:

abcdefghijklmnopqrstuvwxyz
á ãä ç éêëìíîïðñò õö øùú
_________________
Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo/
gentoo(a)mindspill.net
Back to top
View user's profile Send private message
Knieper
l33t
l33t


Joined: 10 Nov 2005
Posts: 846

PostPosted: Fri Jun 15, 2007 8:43 am    Post subject: Reply with quote

stephandale wrote:
For completeness, I tried to convert the utf8 file back to ascii, but I can't get it to work.

Never convert Unicode to other character sets.

Quote:
Does anyone know how to convert from utf8 to ascii?

If you find a non-ascii character set with all your characters use that one (p.ex. iso8859-15). Else you have to define a mapping from "unknown characters" (p.ex. ä) to "known characters" (p.ex. ae or a). Most simple solution: omit these characters :) (iconv -cs)
Back to top
View user's profile Send private message
stephandale
Tux's lil' helper
Tux's lil' helper


Joined: 28 Jan 2006
Posts: 86

PostPosted: Fri Jun 15, 2007 9:13 am    Post subject: Reply with quote

Quote:
Most simple solution: omit these characters :) (iconv -cs)


Yep, that does the job nicely - give me a chisel over a scalpel any day ;-) Thanks.
_________________
Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo/
gentoo(a)mindspill.net
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum