View previous topic :: View next topic |
Author |
Message |
gna n00b
Joined: 19 Mar 2003 Posts: 38 Location: Beijing
|
Posted: Thu Apr 29, 2004 12:15 am Post subject: HOWTO: Using UTF-8 on Gentoo (edited) |
|
|
HOWTO: Using UTF-8 on Gentoo
Changes
2004-05-10 Added Samba and Converting Filenames sections, edited Setting the Locale, recommended xorg-x11, added link to forums discussion and openi18n testsuites, small cleanup.
Introduction
Every now and again there is a flurry of interest in the forums or the mailing lists from people that want to use UTF-8 on Gentoo. Despite this interest and some information in the Gentoo Linux Localization Guide there are still quite a few questions and not much progress in moving Gentoo to a fully UTF-8 based distribution. My hope is that this HOWTO might increase the understanding of UTF-8 and related issues in Gentooland. There is a great deal that I don't know about this topic and I am hoping that this HOWTO might generate corrections and additional hints that will also improve my understanding of Linux UTF-8 issues.
Locales
The way to use UTF-8 is through locales. To do this you need glibc 2.2 or later and you also need to compile glibc with the nls USE flag set. It's set by default so you probably already have done this. Locales are the way that you specify that various aspects of your system should use local conventions. In this HOWTO we are only looking at the encoding aspect of locales.
If no locale is set then the system uses the default locale which is the C or POSIX locale.
Environment Variables
Locales are set up through environment variables. There are quite a range of locale related environment variables and they interact with each other somewhat. Most of them begin with LC_ and so are called the LC_* environment variables. The main ones to be concerned about are LANG, LC_CTYPE and LC_ALL. For a brief explanation of all of them see here and for all the gory details see here.
LANG is the variable normally used to set the locale. The LC_* variables (except LC_ALL) are used to modify parts of that locale but not others. You might want to do this if you were say a German living in Japan and wanted a basically German locale but to have dates in a Japanese format. You would do this by setting: Code: | export LANG=de_DE.UTF-8@euro
export LC_TIME=ja_JP.UTF-8 | This technique is often used with the LC_CTYPE variable by people who want to be able to input in another language especially Chinese, Japanese or Korean.
The LC_ALL variable overrides the values of LANG and the other LC_* variables. So if you set it there is no point in setting the others. Don't set LC_ALL for all users as then if a user wants to mix different locales together they also need to unset LC_ALL. Using LANG will cause less confusion.
The syntax for the locale environment variables is Code: | [language[_territory][.codeset][@modifier]] | The meaning of the brackets is that that part is optional. However leaving off various parts can cause problems so it is better to always include every part except the modifier which is often not used.
When using more than one locale environment variable you must use the same encoding for every one. The shortened forms of locales have a default codeset and it usually isn't UTF-8 so this is one reason why you should always include the codeset part of the locale.
Commands
Two commands that are very helpful for setting up a UTF-8 locale are and These commands are part of the glibc package and there are no man pages. Some documentation can be found in glibc-2.3.2.tar.bz2/glibc-2.3.2/localedata/README assuming you are using version 2.3.2 of glibc.
You can list all the locales that glibc knows about using You can list your current locale settings using just The values of locale environment variables that have been explicitly set e.g. in an export statement (if using bash) are listed without double quotes. Those whose value has been inherited from other locale environment variables have their values in double quotes. Thus if you set only LANG=en_AU.UTF-8 and LC_CTYPE=ja_JP.UTF-8 then you will get:
Code: | gna $ locale
LANG=en_AU.UTF-8
LC_CTYPE=ja_JP.UTF-8
LC_NUMERIC="en_AU.UTF-8"
LC_TIME="en_AU.UTF-8"
LC_COLLATE="en_AU.UTF-8"
LC_MONETARY="en_AU.UTF-8"
LC_MESSAGES="en_AU.UTF-8"
LC_PAPER="en_AU.UTF-8"
LC_NAME="en_AU.UTF-8"
LC_ADDRESS="en_AU.UTF-8"
LC_TELEPHONE="en_AU.UTF-8"
LC_MEASUREMENT="en_AU.UTF-8"
LC_IDENTIFICATION="en_AU.UTF-8"
LC_ALL= |
If the locale you would like to use doesn't already exist then you will need to make it. UTF-8 locales are not made by default so you will probably need to do this. Say you want to an Australian English UTF-8 locale then you would use the following command: Code: | localedef -i en_AU -f UTF-8 en_AU.UTF-8 | Documentation for localedef can be found in the Single Unix Specification here. The locales are stored in /usr/lib/locale/locale-archive
At this point you might be wondering why lists the encoding part of the locale in lower case without any hyphens but these instructions always use UTF-8. The reason is that while glibc understands both forms of the name many other programs don't. The most common example of which is X. So it is best to always use UTF-8 in preference to utf8.
Setting the Locale
Where to set the environment variables is the next problem. Ideally every user on the whole system should use the same encoding i.e. UTF-8. However using a locale other than the "C" or "POSIX" locale for root is asking for trouble on Gentoo. The reason is that it can break emerges as the semantics of scripts etc can change. See the following bugs 8680, 38418 and 9988.
The Gentoo way to set environment variables is to set it in /etc/env.d. The Desktop Configuration Guide suggests setting the locale there. But this doesn't support setting a different locale for root. Some suggest suggest changing the locale in /etc/profile. It is possible to set the locale differently in /etc/profile for root and other users as is done with the PATH environment variable. Otherwise you can set them in $HOME/.bashrc or $HOME/.xinitrc if you only want the locale set under X.
X
X has it's own locale information. This is kept in the /usr/X11R6/lib/X11/locale directory. The localedef command only makes glibc locales it doesn't make X locales. Fortunately XFree86 has quite a few UTF-8 locales already set up but unfortunately they don't always work as well as they should. If you get a Code: | Gdk-WARNING **: locale not supported by Xlib | error when running GTK/Gnome apps and a Code: | Qt: Locales not supported on X server | when running Qt/KDE apps then you should dive into the /usr/X11R6/lib/X11/locale directory. Check that your desired locale is listed in locale.dir. The mapping between shortened names and full names for locales is in locale.alias. If those files are ok then something more fundamental is broken and I suggest you check out the XFree86 bugzilla or the Xorg equivalent depending on which one you are using.
If you are having difficulties getting your UTF-8 locale to operate correctly using XFree86 then I recommend switching to xorg-x11. Its support for UTF-8 locales seems to be much better. Gentoo is adopting xorg-x11 in preference to XFree86 but as of mid May 2004 xorg is not yet marked as stable. There are useful instructions on how to upgrade in the Documentation, Tips and Tricks forum.
Dual Booting Issues
If you are dual booting with Windows then you should read the kernel documentation in /usr/src/linux/Documentation/filesystems/ for your filesystems and kernel version to determine the correct options to put in /etc/fstab so that the file names are properly converted into UTF-8. For NTFS file systems you should use the nls=utf8 option. The utf8=<bool> option has been depreciated for quite a while but some man pages haven't caught up with this yet. For vfat filesystems on the other hand you should use the utf8=yes option.
Samba
Samba 2.2 doesn't support unicode so if you want filenames that use non ASCII characters to look the same on Windows and Gentoo with a UTF-8 locale then you need to upgrade to Samba 3. If your windows clients are unicode capable (Windows 2000 and XP, probably NT also but the Samba documentation isn't clear about it) then the Samba 3 configuration will convert the filenames to UTF-8 by default. If you have Windows 95, 98, Me or older clients then you need to set the "dos charset" option in the /etc/samba/smb.conf file to the appropriate codepage. See Chapter 26 of the Samba-3 HOWTO for more details. If you are upgrading from Samba 2 to Samba 3 then you will need to convert the filenames in your existing shares to UTF-8. Your filenames will be using the codepage of your Windows clients. If you are using English Windows 2000 or XP then your default codepage is probably 1252.
Converting Filenames
If you are sure all your filenames use only ASCII characters (codepoints < 128) then you don't need to bother as they will already be valid UTF-8. The simplest way to convert the encodings of filenames is to use the convmv utility written by Bjoern Jacke. This utility is in portage and will convert the encodings of a whole directory tree. So if your samba share is in the /samba directory you can do the following:
Code: | $ ACCEPT_KEYWORDS='~x86' emerge convmv
$ convmv -r -f cp1252 -t utf8 /samba |
This will run a test and print what it is going to do. To actually do the conversion you need to add the --notest option. There is a man page which is quite helpful.
Resources
There are a number of useful resources about UTF-8 on linux. This pdf file gives a very good introduction.
Markus Kuhn's UTF-8 for Linux FAQ is well known and helpful but is more oriented towards developers than users.
Mike Fabian's CJK Support for SuSE is wide ranging, frequently updated and much of what is discussed is generally applicable to UTF-8 on any linux distribution.
The xfree86 i18n mailing list is very helpful as is the linux-utf8 mailing list.
There is a lot of detailed information in the Single Unix Specification.
One should not forget OpenI18N.org that was formerly LinuxI18n. They are setting standards for Opensource internationalization. They are also developing i18n testsuites for Linux and Unix.
Freedesktop.org have a UTF-8 promotion section. |
|
Back to top |
|
|
Biggles Tux's lil' helper
Joined: 06 Nov 2003 Posts: 123 Location: New Zealand
|
Posted: Tue May 11, 2004 2:02 am Post subject: |
|
|
A very well written howto.
I'm having trouble creating the locale for unicode though. I create it as root, since when I tried to create it as a normal user it complained about not having access. I don't think it worked quite right though. When I use the locale as root, im-ja-xim-server works fine, but when I use it as my normal user I get this message:
Code: | im-ja-xim-server
*** WARNING: Your locale (LC_CTYPE=en_NZ.UTF-8;LC_NUMERIC=C;LC_TIME=en_NZ.UTF-8;LC_COLLATE=en_NZ.UTF-8;
LC_MONETARY=en_NZ.UTF-8;LC_MESSAGES=en_NZ.UTF-8;LC_PAPER=en_NZ.UTF-8;
LC_NAME=en_NZ.UTF-8;LC_ADDRESS=en_NZ.UTF-8;LC_TELEPHONE=en_NZ.UTF-8;
LC_MEASUREMENT=en_NZ.UTF-8;LC_IDENTIFICATION=en_NZ.UTF-8) is non-UTF8, nor Japanese! |
Any ideas? It makes me wonder if the locale got created properly when I did "localedef -i en_NZ -f UTF-8 en_NZ.UTF-8"
Also, are there any unicode fonts that need to be installed to get proper support for non-ansi characters? |
|
Back to top |
|
|
gna n00b
Joined: 19 Mar 2003 Posts: 38 Location: Beijing
|
Posted: Tue May 11, 2004 11:28 am Post subject: |
|
|
Quote: | A very well written howto. |
Thanks
Quote: | I'm having trouble creating the locale for unicode though. I create it as root, since when I tried to create it as a normal user it complained about not having access. |
Sorry I forgot to mention that you must run localedef as root.
If it works as root but not as a normal user then that sounds like a permissions problem to me. Check the permissions in /usr/lib/locale/locale-archive and /usr/X11R6/lib/X11/locale and subdirectories.
Also check the xserver log file for error messages /var/log/XFree86.0.log or /var/log/Xorg.0.log
It's possible that you might need to create the ja_JP.UTF-8 locale.
If none of this works setting LC_CTYPE=ja_JP.UTF-8 might help.
Silly question I know but you do have XMODIFIERS set correctly don't you? XMODIFIERS and the locale enviroment variables need to be set before X starts, not in an xterm window: in $HOME/.xinitrc is ok as are the other places mentioned in the HOWTO.
You will need japanese fonts. Note that modern fonts can work with multiple encodings. As you are interested in Japanese I particularly recommend Mike Fabian's CJK on Suse document (link in the resources section of the HOWTO). It is more oriented towards Japanese than Chinese and Korean and it has quite a bit of information about font issues. |
|
Back to top |
|
|
Biggles Tux's lil' helper
Joined: 06 Nov 2003 Posts: 123 Location: New Zealand
|
Posted: Tue May 11, 2004 10:34 pm Post subject: |
|
|
All the permissions were set to allow everyone to read for all those files, and there was nothing in the log. I tried creating the ja_JP.UTF-8 locale and setting C_TYPE, and that seemed to fix the problem. Now instead of complaining about locales, im-ja-xim-server freezes X instead. I recall hearing that it needs to be started before X or something, which is probably what's wrong (although putting it in my .xinitrc didn't help), but that's a problem for a different topic. |
|
Back to top |
|
|
ecatmur Advocate
Joined: 20 Oct 2003 Posts: 3595 Location: Edinburgh
|
Posted: Wed May 12, 2004 12:20 am Post subject: |
|
|
Can I just say, thanks for this. I managed to work it out on my own a few months ago, but this would have saved me loads of time - and if more people posted on these forums in UTF-8, then we'd be able to read pages in mixed scripts without seeing (?) characters all over the place...
One thing you might add: when serving UTF-8 encoded pages from apache, you need to put "AddDefaultEncoding UTF-8" in the relevant config files (/etc/apache2/commonhttpd.conf etc.) _________________ No more cruft
dep: Revdeps that work
Using command-line ACCEPT_KEYWORDS? |
|
Back to top |
|
|
gna n00b
Joined: 19 Mar 2003 Posts: 38 Location: Beijing
|
Posted: Thu May 13, 2004 3:09 am Post subject: |
|
|
Quote: | One thing you might add: when serving UTF-8 encoded pages from apache, you need to put "AddDefaultEncoding UTF-8" in the relevant config files (/etc/apache2/commonhttpd.conf etc.) |
I think you must mean "AddDefaultCharset utf-8" which is apparently in Apache 1.3.12 and later. I found one person who claimed this directive was bad because it overrode the charset in the <meta> element.
Do you use php? I came across this which has some information about php settings for UTF-8 |
|
Back to top |
|
|
AliceDiee n00b
Joined: 22 Jan 2004 Posts: 74 Location: Hattingen, Germany
|
Posted: Thu May 13, 2004 9:31 am Post subject: |
|
|
Thank you for this HowTo!!
Everything is working just fine except of those applications compiled against gtk1.2.x (sylpheed-claws and mplayer in my case)
German "Umlaute" aren`t displayed correct (only in the menus, the rest is ok) until I start them with
Code: | LC_ALL="de_DE@euro" programname |
so I don't think it`s a problem of the used font, isn`t it? |
|
Back to top |
|
|
gna n00b
Joined: 19 Mar 2003 Posts: 38 Location: Beijing
|
Posted: Thu May 13, 2004 1:31 pm Post subject: |
|
|
When you do this Quote: | Code: | LC_ALL="de_DE@euro" programname |
| you are changing the encoding to the default German encoding. That is iso8859-15.
According to the slypheed-claws FAQ. They have specified a default font with an iso8859-1 encoding so it isn't surprising that it doesn't work when you change to a UTF-8 locale. I think iso8859-1 is the same as iso8859-15 except for the euro symbol which is why the umlauts start working again when you switch to an iso8859-15 encoding.
I haven't figured out fonts very well yet but I think there are still quite a few older style fonts in X that are tied to only one encoding. It's only the newer formats e.g. truetype that can handle more than one encoding. |
|
Back to top |
|
|
AliceDiee n00b
Joined: 22 Jan 2004 Posts: 74 Location: Hattingen, Germany
|
Posted: Fri May 14, 2004 10:50 am Post subject: |
|
|
Yeah, I already noticed that, but those font definitions don`t affect the menus! They only change the appearance of the message-tree, the overview and the messages themselfes.
Anyway it`s running now. I had to create a .gtkrc in my homedirectory with the following content
Code: | style "gtk-default" {
fontset = "-*-helvetica-medium-r-normal-*-12-*-*-*-*-*-iso10646-1"
}
class "GtkWidget" style "gtk-default" |
Choose any other font you like supporting iso10646-1 (maybe with xfontsel) |
|
Back to top |
|
|
revertex l33t
Joined: 23 Apr 2003 Posts: 806
|
|
Back to top |
|
|
gna n00b
Joined: 19 Mar 2003 Posts: 38 Location: Beijing
|
Posted: Sat May 15, 2004 5:34 am Post subject: |
|
|
Quote: | I had to create a .gtkrc in my homedirectory with the following content |
This is great! After googling for a bit it seems to me that this is definitely the way to go. I have found that that Mandrake, Suse and Fedora all use a similar solution. There are two differences though. They put it in the file /etc/gtk/gtkrc.utf8 That fixes the problem for all users and doesn't cause any problems if a user isn't using a UTF-8 locale. The second difference is that they all specify different fonts.
My feeling is that the best thing to do is to take the /etc/gtk/gtkrc.iso-8859-15 file and copy it to /etc/gtk/gtkrc.utf8 and then edit it. You should replace every occurence of 8859-15 in the file with 10646-1. This means that everything on your system will be the same as before but compatible with UTF-8.
Your solution looks like it would cause a problem if you went back to an iso-8859-15 locale. |
|
Back to top |
|
|
AliceDiee n00b
Joined: 22 Jan 2004 Posts: 74 Location: Hattingen, Germany
|
Posted: Mon May 17, 2004 10:23 pm Post subject: |
|
|
Quote: | They put it in the file /etc/gtk/gtkrc.utf8 That fixes the problem for all users and doesn't cause any problems if a user isn't using a UTF-8 locale. |
So far so good but since I`m using gtk-themes it isn`t working for me. Seems that this file gets overridden when you customize your look and feel! |
|
Back to top |
|
|
agnitio Tux's lil' helper
Joined: 17 Apr 2004 Posts: 136
|
Posted: Fri May 21, 2004 12:33 am Post subject: |
|
|
Hmm, I don't know if I've missed some basic thing. But I followed this guide and now I can't use non-english characters in terminals, it just results in strange looking characters (as utf-8 does when displayed on a non-utf8 system). In X everything seems to work fine so far though.
Do I have to set the consolefont to something special or what am I doing wrong? |
|
Back to top |
|
|
AliceDiee n00b
Joined: 22 Jan 2004 Posts: 74 Location: Hattingen, Germany
|
Posted: Fri May 21, 2004 1:17 pm Post subject: |
|
|
Try this in your /etc/rc.conf
Code: | CONSOLEFONT="lat9u-16" |
|
|
Back to top |
|
|
agnitio Tux's lil' helper
Joined: 17 Apr 2004 Posts: 136
|
Posted: Fri May 21, 2004 2:13 pm Post subject: |
|
|
AliceDiee wrote: | Try this in your /etc/rc.conf
Code: | CONSOLEFONT="lat9u-16" |
|
Ah, thanks alot! That did the trick, now.. the only thing left is getting it to work in my terminal windows. I googled around a bit and found that aterm and eterm does not support utf8, is this correct? Anyway, I tried rxvt and I've tried setting it to fonts that should be supported and I've tried setting encoding to utf8 (wich started rendering chineese characters) and I can't seem to get it to work right. Xterm seems to have it working right away though but on startup I get this message:
Code: |
xterm: Can't execvp "/usr/X11R6"/bin/luit: Filen eller katalogen finns inte
xterm: cannot support your locale.
|
I suppose that path is a bit weird because "luit" is right there in /usr/X11R6/bin/luit.
Thanks for a very nice guide! |
|
Back to top |
|
|
AliceDiee n00b
Joined: 22 Jan 2004 Posts: 74 Location: Hattingen, Germany
|
Posted: Fri May 21, 2004 4:34 pm Post subject: |
|
|
Maybe you want to give mlterm (multi-language terminal) a chance, it's in portage and supports utf8 and transparency
Last edited by AliceDiee on Fri May 21, 2004 11:45 pm; edited 1 time in total |
|
Back to top |
|
|
rounin Tux's lil' helper
Joined: 13 Apr 2003 Posts: 84
|
Posted: Fri May 21, 2004 11:04 pm Post subject: |
|
|
Thanks for mentioning convmv.
Using localedef to convert locales to UTF-8 is pretty easy when you know it, but wouldn't it be better if they were generated by default? |
|
Back to top |
|
|
TecHunter Tux's lil' helper
Joined: 15 Feb 2003 Posts: 124
|
Posted: Sat May 22, 2004 12:29 pm Post subject: |
|
|
i just set my locale to UTF-8, but when i startup beep-media-player, it gets SIGSEGV... does beep-media-player not have UTF-8 support? _________________ Gentoo is GREAT!!! |
|
Back to top |
|
|
rounin Tux's lil' helper
Joined: 13 Apr 2003 Posts: 84
|
Posted: Sat May 22, 2004 12:33 pm Post subject: |
|
|
I've used beemp media player in an UTF-8 locale, so that should work. (Beep does have a tendency to crash, though.)
Are you sure you have the UTF-8 locale that you're trying to use? |
|
Back to top |
|
|
vdboor Guru
Joined: 03 Dec 2003 Posts: 592 Location: The Netherlands
|
Posted: Sat May 22, 2004 2:56 pm Post subject: |
|
|
This is very cool!! Suddently a lot of gnome applications appear to have translated messages..
Just a question, where could I set this environment variable for a user if he/she logs into KDE with KDM?
I've set the language for KDE-Applications to Dutch in the control panel, but this doesn't change the language of console applications (and GTK apps). ...how can I solve this? _________________ The best way to accelerate a windows server is by 9.81M/S²
Linux user #311670 and Yet Another Perl Programmer
[ screenies | Coding on KMess ] |
|
Back to top |
|
|
rounin Tux's lil' helper
Joined: 13 Apr 2003 Posts: 84
|
Posted: Sat May 22, 2004 3:06 pm Post subject: |
|
|
export LANG=xx_XX
export LC_ALL=xx_XX
Put it in the users' .bashrc I guess |
|
Back to top |
|
|
vdboor Guru
Joined: 03 Dec 2003 Posts: 592 Location: The Netherlands
|
Posted: Sat May 22, 2004 3:45 pm Post subject: |
|
|
The .bashrc won't be read if kde starts. (only if you start a terminal window with bash)
Currently I've hacked a little in the "startkde" script, but it's a ugly hack:
Code: | if grep -q "Language=nl" ~/.kde/share/config/kdeglobals; then
export LANG="nl_NL"
fi |
at least it works :p _________________ The best way to accelerate a windows server is by 9.81M/S²
Linux user #311670 and Yet Another Perl Programmer
[ screenies | Coding on KMess ] |
|
Back to top |
|
|
TecHunter Tux's lil' helper
Joined: 15 Feb 2003 Posts: 124
|
Posted: Sat May 22, 2004 3:46 pm Post subject: |
|
|
rounin wrote: | I've used beemp media player in an UTF-8 locale, so that should work. (Beep does have a tendency to crash, though.)
Are you sure you have the UTF-8 locale that you're trying to use? |
this is my locale -a | grep UTF-8 output:
Code: | en_US.UTF-8
zh_CN.UTF-8 |
i'm going to use zh_CN.UTF-8
but zh_CN.UTF-8 isn't in localedel --list-archive output... _________________ Gentoo is GREAT!!! |
|
Back to top |
|
|
ecatmur Advocate
Joined: 20 Oct 2003 Posts: 3595 Location: Edinburgh
|
|
Back to top |
|
|
skyfolly Apprentice
Joined: 16 Jul 2003 Posts: 245 Location: Dongguan & Hong Kong, PRC
|
Posted: Fri Jul 23, 2004 3:59 am Post subject: |
|
|
I wish there is an ebuild on it. _________________ I am the only being whose doom
No tongue would ask no eye would mourn
I never caused a thought of gloom
A smile of joy since I was born.
emily bronte |
|
Back to top |
|
|
|