Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
bash: convert numeric char reference in XML/HTML to UTF-8
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
tbart
Apprentice
Apprentice


Joined: 31 Oct 2004
Posts: 151

PostPosted: Thu Feb 18, 2010 10:21 pm    Post subject: bash: convert numeric char reference in XML/HTML to UTF-8 Reply with quote

Hi!

A very short yet difficult question:

Can anyone solve this with only standard bash and friends? (Script will be ported to Cygwin and I prefer not having to install perl; if awk is the only way I guess it will still be smaller than perl?)

Code:
echo "xml_or_html_encoded_chars_& #246;& #273;_that_want_to_be_converted_with_bash_and_tools_preferrably_without_awk" | sed -e 's# ##g' | perl -C6 -pe 's/&#([^;]*);/sprintf("%s", chr($1))/ge;'


(remove the blanks and the sed part if you like; the forum did not let me post the codes as it converted them...)

chr in bash would be something like this:
Code:
#!/bin/bash
printf -v chr \\\\0%o $1
echo -e $chr

but that can only do one-byte wide chars. And I still don't know how to substitute that in a string.

Converting a single encoded character is not too hard:

(
Code:
iconv -f UCS-2BE -t UTF-8
could be used, but we need some conversion to hex before...)

This one works ok:
Code:
/usr/bin/printf "\u`printf %04X \`echo "& #246;" | sed -e 's:& #\([^;]*\);:\1:'\``"

(note the /usr/bin/printf, the bash built-in cannot cope with \u; again, the blank is just to prevent phpBB from converting the char reference..)

BUT: How do I do this several times in a string as a substitution (just like perl does it)?
Back to top
View user's profile Send private message
Akkara
Bodhisattva
Bodhisattva


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Fri Feb 19, 2010 11:47 am    Post subject: Reply with quote

I noticed no one has replied yet, and was really hoping someone would because the only solution I can think of seems unwieldy, which means it probably isn't the best one. But if there really is no better way, here is this one.

The basic idea is just have a substitution rule for each character-reference. E.g., s/«/«/g

Except there's many possible character-references which makes this a very long and very inefficient way of doing things if you want to include them all. This really needs something like perl or python that supports arrays, or for sed to pipe a substitution through an external command, which, as far as I know, isn't possible.

The great many character-references also makes typing it all in by hand very tiresome and error-prone.

So here's a shell-script that outputs the required sed script. Run it once, stick its output in a file, and that becomes your conversion script. To make the result not quite as long, the last digit of the character-reference isn't matched in the first stage. The 10 characters corresponding to 0-9 in the last digit are returned, and then the correct one is picked out at the end.

(Also, this generates a shell-script that invokes sed. It can be easily modified to generate a sed script directly.)

Code:
#!/bin/sh
echo "sed   \\"
for (( i=1; i < 6554; ++i )); do
    hexlist=`printf "%04X %04X %04X %04X %04X %04X %04X %04X %04X %04X" $i"0" $i"1" $i"2" $i"3" $i"4" $i"5" $i"6" $i"7" $i"8" $i"9"`
    chars=`for hex in $hexlist; do
        /usr/bin/printf "\u$hex" 2>/dev/null || echo -n '?'
    done`
    if [ "$chars" \!= "??????????" ]; then
        echo "-e 's:&#$i\([0-9]\);:\&##\1$chars;:g'   \\"
    fi
done

p=""
q="........."
for (( i=0; i < 10; ++i )); do
    echo "-e 's:&##$i$p\(.\)$q;:\1:g'   \\"
    p=$p.
    q=`echo "$q" | sed 's:^.::'`
done
echo


The 6554 above runs over 65536 (and a few more) character-references. This can be reduced (and the corresponding script shorted) if you don't need to translate all of them.
Back to top
View user's profile Send private message
tbart
Apprentice
Apprentice


Joined: 31 Oct 2004
Posts: 151

PostPosted: Mon Feb 22, 2010 10:42 am    Post subject: Reply with quote

hmm... yes. The same thing came to my mind as well but I discarded the idea the moment I thought about the size of the "translation table". Thanks for the effort nevertheless, I might still be using it to avoid having to deploy perl!

I also overthought my unexplainable dislike against awk and took a look at it. Not so bad I have to say. But as strange as it may seem, it can output utf-8 characters, but -- unlike perl's -- its printf %c cannot take values above 128 or 255 respectively. This is what I read for gawk. So what I also thought about would be some (possibly already existing) patched version of ?awk that is able to do the same thing as perl's printf can do. cygwin's perl weighs in at a hefty >10 MB (which nearly doubles my deployment for just one line of code..) whereas gawk has a mere 600 something Ks.

If that's not available: In terms of code readability/maintainability I think I should still prefer perl. Or port the whole script to perl so I don't need cygwin altogether (but I'm nowhere near a perl pro...)

Thanks for your input, Akkara, always appreciated (*remembers the verbose redirection stuff*)!
Back to top
View user's profile Send private message
tbart
Apprentice
Apprentice


Joined: 31 Oct 2004
Posts: 151

PostPosted: Fri Mar 19, 2010 3:38 pm    Post subject: Reply with quote

something came to my mind but I don't have the time atm to find out whether this could work out....

bash itself (!) can operate on matched patterns:

STOP. This is what I wanted to write. Seems I had to have the time :->

Code:
#!/bin/bash

# htmltoutf8 (nearly) in bash only

stringtoconvert="& #214;sterreichNieder& #246;sterreich"
for (( i=0; i<${#stringtoconvert}; i++ ))
do
   # numbers 0 to 999 for the moment
   if [[ ${stringtoconvert:$i:6} =~ (.*)\&#([^\;]*)\;(.*) ]]
   then
      # substitute as above with printf and modify stringtoconvert
      # or write to a new variable something like
      convertedstring="${convertedstring}${BASH_REMATCH[1]}$(/usr/bin/printf "\u`printf %04X ${BASH_REMATCH[2]}`")${BASH_REMATCH[3]}"
      ((i+=5))
   else
      # just copy over original string
      convertedstring="${convertedstring}${stringtoconvert:$i:1}"
fi
done

echo "$convertedstring"


(As always, remove the blanks between ampersand and hash)


Nice, eh?
Now if /usr/bin/printf could be some Bash internal it would be even faster on cygwin as heavy fork-execs are damn slow...
And then I need the reverse thing, but that won't be a problem I guess...
Back to top
View user's profile Send private message
Mike Hunt
Watchman
Watchman


Joined: 19 Jul 2009
Posts: 5287

PostPosted: Fri Mar 19, 2010 7:16 pm    Post subject: Reply with quote

What about the /usr/bin/recode command? See: man recode
Could that be a possible workable option?
I never used it, so not sure if it would make sense here.
Back to top
View user's profile Send private message
tbart
Apprentice
Apprentice


Joined: 31 Oct 2004
Posts: 151

PostPosted: Sun Mar 21, 2010 10:44 pm    Post subject: Reply with quote

well that was a nice tip.
I am currently not at a cygwin station, but on my linux box recode and my self-built function are equally fast (yay!)

recode (app-text/recode if anyone wonders) may be faster on cygwin I guess because I would only have to fork-exec once and not once for every character.

will check that. thanks for the input!
Back to top
View user's profile Send private message
Mike Hunt
Watchman
Watchman


Joined: 19 Jul 2009
Posts: 5287

PostPosted: Sun Mar 21, 2010 11:12 pm    Post subject: Reply with quote

Cool that recode does it, wasn't sure - just a hunch really. :)
Back to top
View user's profile Send private message
tbart
Apprentice
Apprentice


Joined: 31 Oct 2004
Posts: 151

PostPosted: Mon Mar 22, 2010 10:08 am    Post subject: Reply with quote

well I shouldn't post after a hard day of working...

my actual problem now - as mentioned in the post containing the bash solution - is heavy fork-exec load under cygwin. I am calling this function a few thousand times only to know whether a file in an XML exists (and to skip processing then).
Be it perl, sed, awk, recode, iconv whatever: The fork-exec needed to call this external utility takes around one second (!) on a 4x2.83GHz Core2Quad. So a few thousand seconds just to skip files. This is boooooring.

That's why I am searching for a solution right inside bash.
(granted, recode is the cleanest solution and I will definitely use it for such things in the future instead of perl scripts that handle each exception (& amp; and the like) separately and are error-prone (maybe there are more special characters I forgot?) )

My script posted above is equally fast as perl and recode on my linux box. But on cygwin, I fear, the fork-exec still takes place for /usr/bin/printf so it won't be any faster. Is there any bash-internal way of printing the character associated with a utf-8 encoding like /usr/bin/printf \u ?

If I find that, the whole conversion will take place right inside bash, so no context switch and no associated waiting.
And no additional programs to deploy with my cygwin installation (well that's already fulfilled with my script...)

ideas?
Back to top
View user's profile Send private message
tbart
Apprentice
Apprentice


Joined: 31 Oct 2004
Posts: 151

PostPosted: Mon Mar 22, 2010 3:20 pm    Post subject: Reply with quote

...and the winner is:

Code:
#!/bin/bash
read -p "please enter decimal codepoint to convert: " deccp

# /usr/bin/printf is external, we want everything to be internal because cygwin is slow on fork-exec
# convert decimal unicode codepoint representation to octal utf-8 representation

# deccp - decimal codepoint
# octcp - octal codepoint
# utf8val - utf8 encoded multi-byte character encoded in multiple octal values

if [[ $deccp -lt 127 ]] # ASCII or 1 byte UTF8 character
then
   printf \\$(printf "%o" $deccp)
# 2 byte UTF8 character
# first type definition: binary 110....., octal \3..
# second type definition: binary 10......, octal \2..
elif [[ $deccp -gt 128 ]] && [[ $deccp -lt 2048 ]]
then
   printf -v octcp %o $deccp
   [[ ${#octcp} == 3 ]] && utf8val="\30${octcp:0:1}\2${octcp:1:2}" || utf8val="\3${octcp:0:2}\2${octcp:2:2}"
   printf "$utf8val"
#elif [[ $deccp -gt 2048 && $deccp -lt 65535 ]]
#then
#elif [[ $deccp -gt 65535 && $deccp -lt 1114111 ]]
#then
fi


brutal, but bash only! can be extended to more than 2 byte wide chars, but I don't need that atm.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum