| View previous topic :: View next topic |
| Author |
Message |
stephandale Tux's lil' helper

Joined: 28 Jan 2006 Posts: 75
|
Posted: Thu Jun 14, 2007 4:18 pm Post subject: How to create UTF-8 files? |
|
|
Hi all.
I've been looking into creating files with UTF-8 encoding and I'm confused! I need to create UTF-8 files for my Rails application, so that I can use one of the internationalisation (or internationalization or i18n) plugins.
I think I run Gentoo with us-ascii encoding - that's the encoding shown when I query a file with 'file -bi [filename]' - but I'm having trouble both using saving files as UTF-8 via vim and converting files to UTF-8 with the recode and convmv commands.
Does Gentoo need to be using utf-8 encoding to allow a file to be converted to utf-8? Ideally, I'd like to leave my system encoding as-is, but save certain files as UTF-8. Is this possible?
Many thanks. _________________ Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo.html
gentoo(a)mindspill.net |
|
| Back to top |
|
 |
Genone Developer


Joined: 14 Mar 2003 Posts: 7363 Location: in a weird world
|
Posted: Thu Jun 14, 2007 5:44 pm Post subject: |
|
|
| Well, the `file` command will always tell you that a file is ASCII text if it only contains normal Ascii characters. The encoding only matters when you use characters beyond the normal english alphabet, so you can't really convert a Ascii file into a UTF-8 file, simply because they would be completely identical. |
|
| Back to top |
|
 |
stephandale Tux's lil' helper

Joined: 28 Jan 2006 Posts: 75
|
Posted: Thu Jun 14, 2007 5:57 pm Post subject: |
|
|
Ok, that's good - I'm a step closer knowing why I always see us-ascii when I'm not expecting it!
How do I test whether a file has been saved as UTF-8? - I've changed my vim config to use UTF-8 and now I need to check whether it's actually creating files as UTF-8. Do you know of any tools I can use to do this? _________________ Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo.html
gentoo(a)mindspill.net |
|
| Back to top |
|
 |
Genone Developer


Joined: 14 Mar 2003 Posts: 7363 Location: in a weird world
|
Posted: Thu Jun 14, 2007 7:00 pm Post subject: |
|
|
| Well, you'd have to create a file that contains characters above ascii value 127 (e.g. german umlauts, french accents, or cyrillic or chinese letters), can't really tell you how to do that though if you can't enter those directly with your keyboard, you'll have to find someone else who can. |
|
| Back to top |
|
 |
jlh Tux's lil' helper


Joined: 06 May 2007 Posts: 140 Location: Switzerland::Zürich
|
Posted: Thu Jun 14, 2007 7:43 pm Post subject: |
|
|
If memory serves, just type ":set fileencoding=utf8" in vim and then save the file. (These days you might even make that a default.) Note that this is completely unrelated to the locale/encoding your shell uses, and also the encoding that vim uses internally while the file is in RAM. The above setting won't be remembered across vim session, but uppon loading a file it will be able to autodetect the format in most cases and will reuse that. But as said above, an US-ASCII files and a utf8 file that doesn't use any non-US-ASCII character are identical and then vim can't guess and will use a default.
If you really want to verify your file, you can hexdump it. If you hexdump a utf8 file, all normal character (US-ASCII) will be represented normally, as you would expect. Fancy characters like german umlautes (öäü) will be two bytes long instead. Some even fancier (but more rare, probably chinese and that stuff) characters will be more than 2 bytes long and the most fancy ones can be up to 6 bytes (probably very rare). Every byte belonging to a multi-byte character will have its high bit set. Given that, spotting utf8 in a hexdump shouldn't be too difficult. (Use google for more info how utf8 works.)
An alternative way to create utf8 (or any other encoding) files is to use iconv, which will happily convert from and to hundreds of encodings. For example "iconv -f latin1 -t utf8 < input > output". (-f and -t mean 'from' and 'to') |
|
| Back to top |
|
 |
stephandale Tux's lil' helper

Joined: 28 Jan 2006 Posts: 75
|
Posted: Fri Jun 15, 2007 3:25 am Post subject: |
|
|
Thank you both.
I'd already set vim to use utf-8 with the options 'set encoding=utf8' and 'set fileencoding=utf-8', though as you both said, the file I created was identical to a us-ascii one because I'd not saved any unicode characters.
Trying again, creating some unicode characters (most character keys on the keyboard will create a unicode equivalent if you hold down the alt key) and the file was reported as utf-8.
Without unicode characters:
| Code: | steph@localhost ~ $ file -bi utf8test.txt
text/plain; charset=us-ascii |
With unicode characters:
| Code: | steph@localhost ~ $ file -bi utf8test.txt
text/plain; charset=utf-8 |
For completeness, I tried to convert the utf8 file back to ascii, but I can't get it to work.
| Code: | steph@localhost ~ $ iconv -f utf8 -t ascii utf8test.txt
abcdefghijklmnopqrstuvwxyz
iconv: illegal input sequence at position 27 |
| Code: | steph@localhost ~ $ iconv -f utf8 -t ascii < utf8test.txt
abcdefghijklmnopqrstuvwxyz
iconv: illegal input sequence at position 27 |
| Code: | steph@localhost ~ $ recode ascii utf8test.txt
recode: utf8test.txt failed: Invalid input in step `ANSI_X3.4-1968..CHAR' |
Does anyone know how to convert from utf8 to ascii?
P.S. The test file contains the following:
abcdefghijklmnopqrstuvwxyz
á ãä ç éêëìíîïðñò õö øùú _________________ Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo.html
gentoo(a)mindspill.net |
|
| Back to top |
|
 |
Knieper Guru

Joined: 10 Nov 2005 Posts: 568
|
Posted: Fri Jun 15, 2007 3:43 am Post subject: |
|
|
| stephandale wrote: | | For completeness, I tried to convert the utf8 file back to ascii, but I can't get it to work. |
Never convert Unicode to other character sets.
| Quote: | | Does anyone know how to convert from utf8 to ascii? |
If you find a non-ascii character set with all your characters use that one (p.ex. iso8859-15). Else you have to define a mapping from "unknown characters" (p.ex. ä) to "known characters" (p.ex. ae or a). Most simple solution: omit these characters (iconv -cs) |
|
| Back to top |
|
 |
stephandale Tux's lil' helper

Joined: 28 Jan 2006 Posts: 75
|
Posted: Fri Jun 15, 2007 4:13 am Post subject: |
|
|
| Quote: | Most simple solution: omit these characters (iconv -cs) |
Yep, that does the job nicely - give me a chisel over a scalpel any day Thanks. _________________ Stephan Dale
http://mindspill.net/computing/linux-notes/gentoo.html
gentoo(a)mindspill.net |
|
| Back to top |
|
 |
|