Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Java: Methods of removing accentuated characters not working
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
coltson
n00b
n00b


Joined: 15 Oct 2005
Posts: 54

PostPosted: Wed May 15, 2013 11:02 pm    Post subject: Java: Methods of removing accentuated characters not working Reply with quote

Hi, I have been trying to remove accented characters from strings, replacing them by their no accentuate one equivalents.
The problem is that it is not working. And the problem ain't algorithm because I tried three different algorithms using copy and paste. Here is part of the output:
Quote:
O welington capxaba ?? um pessimo jogador de futebol.
O welington capxaba ?? um pessimo jogador de futebol.


The first line is before removing the accentuated characters, so it is working as intended (The terminal not being able to show accentuated characters is irrelevant). The second line is where the problem happens. Should be converted to "O welington capxaba e um pessimo jogador de futebol". The original string it is: "O welington capxaba é um pessimo jogador de futebol" I think this problem can be related to a previous problem in which I wasn't able to compile the code. Here is a resumed version of the errors that I got:
Quote:
Procurador de Texto/FileManager.java:116: error: unmappable character for encoding ASCII
case '??':
^
Procurador de Texto/FileManager.java:116: error: unmappable character for encoding ASCII
case '??':
^

and
Quote:
Procurador de Texto/FileManager.java:116: error: unclosed character literal
case '??':
^
Procurador de Texto/FileManager.java:116: error: unclosed character literal
case '??':
^


This happened in lines like:
Code:
case 'ç':
. These two errors happened 88 times through the code, so I am not going to post all. Always in accentuated characters which are used in one of the algorithms.

I freed myself of them with
Quote:
./javac -encoding UTF8 Procurador\ de\ Texto/Main.java Procurador\ de\ Texto/FileManager.java


So I think these two problems are related. Any ideas?
Back to top
View user's profile Send private message
Voltago
Advocate
Advocate


Joined: 02 Sep 2003
Posts: 2593
Location: userland

PostPosted: Sun May 26, 2013 10:15 pm    Post subject: Reply with quote

Coincidentally, I've written something quite similar recently to replace certain unicode characters with their LaTeX representation, which worked on my system. Perhaps you can try to copy and paste the code, and if it produces the same errors as your own, you should have a look at your system locales (command "locale") to see if the character set you are using is able to deal with those symbols. If you haven't already, maybe switching to a UTF-8 locale will solve your problems (there is a document about it in the official gentoo documentation iirc).

EDIT: Seeing as you have already tried different code snippets, and as one accented character is converted to two characters '??', this probably is related to your chosen system locale. In UTF-8, those characters are represented by two bytes (or more), while the standard ascii ones are given by one byte. I think your conversion works correctly, but your system is not able to parse those two-byte characters and gives you two question marks instead.

The output of the "locale" command is:
Code:
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC=en_US.utf8
LC_TIME=de_DE.utf8
LC_COLLATE="en_US.utf8"
LC_MONETARY=de_DE.utf8
LC_MESSAGES=en_US.utf8
LC_PAPER=de_DE.utf8
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=



Code:
public class EncodingTest
{   
    public static String encodeUnicodeToLatex(CharSequence str)
    {
        StringBuilder sb = new StringBuilder();
        String r;
       
        int last=-1;
        int length = str.length();
        for(int i=0; i<length; i++)
        {
            char ch = str.charAt(i);

            if(ch>127) switch(ch)
            {
                case '’':
                    r="'";
                    break;
                case '„':
                    r="``";
                    break; 
                case '“':
                    r="''";
                    break;
                case 'à':
                    r="\\`{a}";
                    break;
                case 'â':
                    r="\\^{a}";
                    break;
                case 'ä':
                    r="\\\"{a}";
                    break;
                case 'æ':
                    r="{\\ae}";
                    break; 
                case 'ç':
                    r="\\c{c}";
                    break;     
                case 'é':
                    r="\\'{e}";
                    break;
                case 'è':
                    r="\\`{e}";
                    break;
                case 'ê':
                    r="\\^{e}";
                    break;
                case 'ë':
                    r="\\\"{e}";
                    break;   
                case 'î':
                    r="\\\"{\\i}";
                    break;
                case 'ï':
                    r="\\\"{\\i}";
                    break;
                case 'ô':
                    r="\\^{o}";
                    break;
                case 'ö':
                    r="\\\"{o}";   
                    break;
                case 'œ':
                    r="{\\oe}";
                    break;
                case 'ß':
                    r="{\\ss}";
                    break;     
                case 'ù':
                    r="\\`{u}";
                    break;
                case 'û':
                    r="\\^{u}";
                    break;
                case 'ü':
                    r="\\\"{u}";
                    break;
                case 'ÿ':
                    r="\\\"{y}";
                    break;
                case 'À':
                    r="\\`{A}";   
                    break;
                case 'Â':
                    r="\\^{A}";
                    break;
                case 'Ä':
                    r="\\\"{A}";
                    break;
                case 'Æ':
                    r="{\\AE}";
                    break; 
                case 'Ç':
                    r="\\c{C}";
                    break;     
                case 'É':
                    r="\\'{E}";
                    break;
                case 'È':
                    r="\\`{E}";
                    break;
                case 'Ê':
                    r="\\^{E}";
                    break;
                case 'Ë':
                    r="\\\"{E}";
                    break;   
                case 'Î':
                    r="\\^{I}";
                    break;
                case 'Ï':
                    r="\\\"{I}";
                    break;
                case 'Ô':
                    r="\\^{O}";
                    break;
                case 'Ö':
                    r="\\\"{O}";   
                    break;
                case 'Œ':
                    r="{\\OE}";
                    break;
                case 'Ù':
                    r="\\`{U}";
                    break;
                case 'Û':
                    r="\\^{U}";
                    break;
                case 'Ü':
                    r="\\\"{U}";
                    break;
                case 'Ÿ':
                    r="\\\"{Y}";
                    break;   
                default:           
                    r=null;
            }
            else
                r=null;
           
            if(r!=null)
            {
//                System.out.printf("%s, %d, %d\n", r, last+1, i);
                CharSequence substr = str.subSequence(last+1, i);
                sb.append(substr);
                sb.append(r);
                last=i;
            }                       
        }

        CharSequence substr = str.subSequence(last+1, length);
        sb.append(substr);
       
        return sb.toString();
    }
   
    public static void main(String[] args) throws Exception
    {
        String str="ùûüÿàâæçéèêëïîôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔŒäöüß’„“ÄÖÜ";
        System.out.println(encodeUnicodeToLatex(str));
    }
}
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum