Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Help setting up UTF-8 filename encoding
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Desktop Environments
View previous topic :: View next topic  
Author Message
eliszka
n00b
n00b


Joined: 13 Nov 2007
Posts: 12
Location: Colorado, USA

PostPosted: Mon Mar 09, 2015 1:36 am    Post subject: Help setting up UTF-8 filename encoding Reply with quote

Right away I'll admit that I have very little experience with setting or changing system encoding settings, up until now if a filename is read/generated correctly great, if not I'll deal with it. I do have a long history with Gentoo on my desktops and laptops.
Now I've built a new Gentoo system and would like it to work right. I've followed the instructions in the Gentoo wiki for setting up the locale and I've read the UTF-8 wiki. Really what I want this for is to be able to read English/French/German/Spanish filenames correctly with the accents and umlauts and what have you. I think a lot of my existing files are in some other encoding but not sure what or how to make all of my apps (xterm/firefox/amarok etc.) read the filenames correctly.
Right now the only application that is reading French correctly is XFE, but it's giving me a warning when I look at the file properties:
=> Warning: file name is not UTF-8 encoded!

I'd also like to know how to detect what encoding is used so I can convert it to UTF-8. I understand that I may be able to use convmv to do this.

Here is what I can think of to provide, if you are willing to help please let me know what other info I need to give:
Code:

eselect locale list
Available targets for the LANG variable:
  [1]   C
  [2]   de_DE
  [3]   de_DE@euro
  [4]   de_DE.iso88591
  [5]   de_DE.iso885915@euro
  [6]   deutsch
  [7]   en_US
  [8]   en_US.iso88591
  [9]   en_US.utf8 *
  [10]  fran�ais
  [11]  french
  [12]  fr_FR
  [13]  fr_FR@euro
  [14]  fr_FR.iso88591
  [15]  fr_FR.iso885915@euro
  [16]  german
  [17]  POSIX
  [ ]   (free form)

/etc/portage/make.conf:
Code:

LINGUAS="en de fr"

Code:

echo $LANG
en_US.utf8

xterm example:
Code:

01-R?sum?.mp3

those are supposed to be e's with accents.

Thanks for reading, any help is appreciated.
Back to top
View user's profile Send private message
VoidMage
Watchman
Watchman


Joined: 14 Oct 2006
Posts: 6193

PostPosted: Mon Mar 09, 2015 10:15 am    Post subject: Reply with quote

Well, that could be a painful problem, as it's not as much a matter of current configuration as all the previous ones.
On ext2fs (and its successors and (most likely) all other standard Linux filesystems) the created filenames are created as char arrays in the current locale. If you switch locale to a one using a different encoding, there'll be no automatic recoding. Windows filesystem (especially VFAT family) have its own set of problems (significantly different, in most cases easier to solve, but not always). Also, until lately, zip didn't store filenames as UTF8/UTF16, so most of those archives floating around faced a similar problem.

If you squint enough, 'LC_ALL=C ls -b' should give you decent hints about the encoding of any particular file. app-text/convmv should help with batch converting, though you shouldn't rely on its autodetection feature - if you know were the file came from, you should have a good idea which encoding was used.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 6960

PostPosted: Mon Mar 09, 2015 10:37 am    Post subject: Reply with quote

Code:
>locale -a | grep fr
fr_FR.utf8
>localedef -i fr_FR.utf8 -f UTF-8 fr_FR.UTF-8
>echo "fr_FR.UTF-8 UTF-8" >> /etc/locale.gen
>locale-gen
>echo "LANG=\"fr_FR.UTF-8\"" > /etc/env.d/02locale
>env-update && source /etc/profile

If you have a problem, reread the wiki, as all this should had been in it already.

encoding type?
Code:
file delay.txt
delay.txt: UTF-8 Unicode text
file overflow.txt
overflow.txt: ASCII text, with very long lines
Back to top
View user's profile Send private message
eliszka
n00b
n00b


Joined: 13 Nov 2007
Posts: 12
Location: Colorado, USA

PostPosted: Mon Mar 09, 2015 11:40 am    Post subject: Reply with quote

thank you both for your replies. Don't have a lot of time to look at this now, but a few quick things:
`LC_ALL=C ls -b` output for me is no different from regular ls -b output.
The `file` command gives me the encoding of the file contents, but what I'm more concerned with here is the encoding the of the filename.

My current locale is en_US.utf8. I used the `rip` command to rip a CD with French song titles. It seems to have encoded the names correctly
because as stated before XFE displays the names correctly. I can't get other apps (xterm,firefox etc.) to do the same. I even followed the rxvt-unicode
wiki yesterday http://wiki.gentoo.org/wiki/Rxvt-unicode and that doesn't display the characters correctly either.
So I understand the problem with filenames generated under different encodings/other filesystems. My problem seems to be getting all of my
currrent apps to behave correctly?

I will re-read the wiki :)
Back to top
View user's profile Send private message
VoidMage
Watchman
Watchman


Joined: 14 Oct 2006
Posts: 6193

PostPosted: Mon Mar 09, 2015 7:15 pm    Post subject: Reply with quote

@krinn: whatever the problem you're trying to solve, it's not the one from the question.

Under unmodified settings, `LC_ALL=C ls -b` should print any character outside of ASCII as an octal escape - that should give you a hint if that name is utf8 encoded or in one of 8bit charmaps. Usually, for French/German that 8bit wlil either be latin1 or cp1252.
Back to top
View user's profile Send private message
eliszka
n00b
n00b


Joined: 13 Nov 2007
Posts: 12
Location: Colorado, USA

PostPosted: Tue Mar 10, 2015 1:10 am    Post subject: Reply with quote

VoidMage wrote:
Under unmodified settings, `LC_ALL=C ls -b` should print any character outside of ASCII as an octal escape - that should give you a hint if that name is utf8 encoded or in one of 8bit charmaps. Usually, for French/German that 8bit wlil either be latin1 or cp1252.


Yes you are absolutely right. In my previous reply I was running ls -b on the wrong file, sorry. So here's what I have:
Code:
01-R\351sum\351.mp3

(the \351 should be an e with an accent) so from what I can tell this is Latin 1.

So now I go back to the `convmv` command and do this:
Code:

$ convmv --notest -f ISO-8859-1 -t UTF-8  01-R�sum�.mp3
mv "./01-R�sum�.mp3"    "./01-Résumé.mp3"
Ready!
$ ls 01*
01-Résumé.mp3


So it seems that my environment is fine, it's the filenames in various (maybe) encodings that are my problem .. or maybe they're all in Latin 1.
I guess my mistake was assuming that UTF-8 would just pick up the other encoding. This character is E9 in Latin 1 and E9 in UTF-8 so that's a bit confusing.
But it definitely doesn't work like I had assumed, and that assumption was based on some vague recollection.
Thank you for helping me out here, if you have any other tips and tricks for dealing with filename encoding I'd like to see them.
Back to top
View user's profile Send private message
charles17
Advocate
Advocate


Joined: 02 Mar 2008
Posts: 2583

PostPosted: Tue Mar 10, 2015 8:09 am    Post subject: Reply with quote

eliszka wrote:
if you have any other tips and tricks for dealing with filename encoding I'd like to see them.
For checking different encodings of contents and filemanes you could use Firefox. From the menu => View => Character Encoding select how to display.
If chosen the correct encoding Firefox should display filenames correctly, e.g.:
Quote:
$ cd /tmp
$ touch çê₵öÄļ
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Desktop Environments All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum