bash: convert numeric char reference in XML/HTML to UTF-8

tbart · Apprentice Joined: 31 Oct 2004 Posts: 151

Hi!

A very short yet difficult question:

Can anyone solve this with only standard bash and friends? (Script will be ported to Cygwin and I prefer not having to install perl; if awk is the only way I guess it will still be smaller than perl?)

Akkara · Posted: Fri Feb 19, 2010 11:47 am Post subject:

I noticed no one has replied yet, and was really hoping someone would because the only solution I can think of seems unwieldy, which means it probably isn't the best one. But if there really is no better way, here is this one.

The basic idea is just have a substitution rule for each character-reference. E.g., s/«/«/g

Except there's many possible character-references which makes this a very long and very inefficient way of doing things if you want to include them all. This really needs something like perl or python that supports arrays, or for sed to pipe a substitution through an external command, which, as far as I know, isn't possible.

The great many character-references also makes typing it all in by hand very tiresome and error-prone.

So here's a shell-script that outputs the required sed script. Run it once, stick its output in a file, and that becomes your conversion script. To make the result not quite as long, the last digit of the character-reference isn't matched in the first stage. The 10 characters corresponding to 0-9 in the last digit are returned, and then the correct one is picked out at the end.

(Also, this generates a shell-script that invokes sed. It can be easily modified to generate a sed script directly.)

tbart · Apprentice Joined: 31 Oct 2004 Posts: 151

hmm... yes. The same thing came to my mind as well but I discarded the idea the moment I thought about the size of the "translation table". Thanks for the effort nevertheless, I might still be using it to avoid having to deploy perl!

I also overthought my unexplainable dislike against awk and took a look at it. Not so bad I have to say. But as strange as it may seem, it can output utf-8 characters, but -- unlike perl's -- its printf %c cannot take values above 128 or 255 respectively. This is what I read for gawk. So what I also thought about would be some (possibly already existing) patched version of ?awk that is able to do the same thing as perl's printf can do. cygwin's perl weighs in at a hefty >10 MB (which nearly doubles my deployment for just one line of code..) whereas gawk has a mere 600 something Ks.

If that's not available: In terms of code readability/maintainability I think I should still prefer perl. Or port the whole script to perl so I don't need cygwin altogether (but I'm nowhere near a perl pro...)

Thanks for your input, Akkara, always appreciated (*remembers the verbose redirection stuff*)!

tbart · Apprentice Joined: 31 Oct 2004 Posts: 151

something came to my mind but I don't have the time atm to find out whether this could work out....

bash itself (!) can operate on matched patterns:

STOP. This is what I wanted to write. Seems I had to have the time :->

Mike Hunt · Watchman Joined: 19 Jul 2009 Posts: 5287

What about the /usr/bin/recode command? See: man recode
Could that be a possible workable option?
I never used it, so not sure if it would make sense here.

tbart · Apprentice Joined: 31 Oct 2004 Posts: 151

well that was a nice tip.
I am currently not at a cygwin station, but on my linux box recode and my self-built function are equally fast (yay!)

recode (app-text/recode if anyone wonders) may be faster on cygwin I guess because I would only have to fork-exec once and not once for every character.

will check that. thanks for the input!

Mike Hunt · Watchman Joined: 19 Jul 2009 Posts: 5287

Cool that recode does it, wasn't sure - just a hunch really.

tbart · Apprentice Joined: 31 Oct 2004 Posts: 151

well I shouldn't post after a hard day of working...

my actual problem now - as mentioned in the post containing the bash solution - is heavy fork-exec load under cygwin. I am calling this function a few thousand times only to know whether a file in an XML exists (and to skip processing then).
Be it perl, sed, awk, recode, iconv whatever: The fork-exec needed to call this external utility takes around one second (!) on a 4x2.83GHz Core2Quad. So a few thousand seconds just to skip files. This is boooooring.

That's why I am searching for a solution right inside bash.
(granted, recode is the cleanest solution and I will definitely use it for such things in the future instead of perl scripts that handle each exception (& amp; and the like) separately and are error-prone (maybe there are more special characters I forgot?) )

My script posted above is equally fast as perl and recode on my linux box. But on cygwin, I fear, the fork-exec still takes place for /usr/bin/printf so it won't be any faster. Is there any bash-internal way of printing the character associated with a utf-8 encoding like /usr/bin/printf \u ?

If I find that, the whole conversion will take place right inside bash, so no context switch and no associated waiting.
And no additional programs to deploy with my cygwin installation (well that's already fulfilled with my script...)

ideas?

tbart · Apprentice Joined: 31 Oct 2004 Posts: 151

...and the winner is: