Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Other Things Gentoo
  • Search

[SOLVED] sed matching any ascii code/hex byte

Still need help with Gentoo, and your question doesn't fit in the above forums? Here is your last bastion of hope.
Post Reply
Advanced search
20 posts • Page 1 of 1
Author
Message
tbart
Apprentice
Apprentice
Posts: 151
Joined: Sun Oct 31, 2004 4:14 pm

[SOLVED] sed matching any ascii code/hex byte

  • Quote

Post by tbart » Tue Jun 09, 2009 4:05 pm

Now this is a tough nut to crack...

Sed is perfect. I love it more every day. It can even operate on binary files! I use it to substitute arbitrary byte sequences with others.. but:

Try this:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33"
easy, we output

Code: Select all

123<some unprintable character>123
suppose we want to substitute that character. easy:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#\xaa#SUBST#'
outputs

Code: Select all

123SUBST123
OK. But how do you substitute, say, 3<the unprintable character>123 with SUBST if you do not know it's \xaa? (It could be any unprintable char...)

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
should do it, you say. Nope. This outputs

Code: Select all

12SUBST<the unprintable char>123
So it seems that . only matches printable characters, converse to what the man page says (any character).

Here goes the question:
What is the correct way to address "any byte value" or "any printable or non-printable value from 0x00 to 0xFF"?

(no, [\x00-\xff] also does not work..)

This has to be possible with sed, it can match any byte value as shown above with \xaa. But how do you address ALL of them?

Thanks a lot in advance!

th
Last edited by tbart on Mon Jun 29, 2009 11:14 am, edited 1 time in total.
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Wed Jun 10, 2009 3:47 am

The problem is your regexp. '.*' matches with any number of characters, including zero! So since sed will match the first result it finds, it matches with the 3 by itself, not including anything after it.

Instead, try this:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.\{1,\}#SUBST#'
'{1}' translates to any one character matching what comes before it (we need to escape the braces so bash doesn't do funny things with them and they actually make it to sed). Similarly {1,3} translates to any 1-3 characters matching what comes before. Doesn't have to be bounded either {1,} matches any one or more and {,3} matches any number up to 3. An example of this is the regexp `[0-9]{1,3}', which will match any 1-3 digit long decimal number. Therefore '.{1}' translates to any one character and this regexp will match with the 3ª and produce `12SUBST123'.

An excellent regexp resource can be found here.

EDIT: Actually, rereading part I was answering, you want 123ª123 to turn into 12SUBST. Your sed expression should do this (and does on my machine), sed is `greedy' and will match with the longest expression it can. If it's not doing that you can explicitly state it as `3.{0,}' (remember to escape the braces \{\}) which is exactly what the `*' operator performs.
Last edited by jw5801 on Wed Jun 10, 2009 4:42 am, edited 1 time in total.
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Wed Jun 10, 2009 4:33 am

Sorry, I should answer the actual question, shouldn't I?

[\x00-\xFF] is along the right track. However it just includes every possible byte value. I'm assuming from your description that you want to find anything that isn't a regular character. The regular characters finish up at 0x7F, so if you want to match anything after that (extended ASCII), then use the regexp: `[\x80-\xff]+'. The `+' matches with one or more of the previous token, performing the same operation as `[\x80-\xff]{1,}' (don't forget to escape the braces if you're using them though: \{\}).
Top
truc
Advocate
Advocate
User avatar
Posts: 3199
Joined: Mon Jul 25, 2005 9:24 am

  • Quote

Post by truc » Wed Jun 10, 2009 5:54 am

jw5801 wrote:The problem is your regexp. '.*' matches with any number of characters, including zero!
Yeah but as you said in your EDIT
sed is `greedy'
, and so the longuest expression matching this regex is from the first 3 to the end of the line.
So since sed will match the first result it finds, it matches with the 3 by itself, not including anything after it.
so this should be wrong
...
I was about to ask if you really tried this, but
EDIT: Actually, rereading part I was answering, you want 123ª123 to turn into 12SUBST. Your sed expression should do this (and does on my machine), sed is `greedy' and will match with the longest expression it can. If it's not doing that you can explicitly state it as `3.{0,}' (remember to escape the braces \{\}) which is exactly what the `*' operator performs.
It looks like you did, so next question is, which sed are you using? I can't replicate this behaviour with gnused-4.2

Sorry, I should answer the actual question, shouldn't I?

[\x00-\xFF] is along the right track. However it just includes every possible byte value. I'm assuming from your description that you want to find anything that isn't a regular character.
Here we go again, does this [\x00-\xFF] actually works for you?

Code: Select all

sed 's/[\x00-\xFF]*//' <<< ''
sed: -e expression #1, char 16: Invalid collation character

Anyway, since we were talking about non-printable characters, I tried this

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed  's/[^[:print:]]/SUBS/'
123�123
But as you can see, it also didn't work :S


I've found that bbe is probably the tool you'll need, unless we found something new with sed :)
The End of the Internet!
Top
Akkara
Bodhisattva
Bodhisattva
User avatar
Posts: 6702
Joined: Tue Mar 28, 2006 12:27 pm
Location: &akkara

Re: sed matching any ascii code/hex byte

  • Quote

Post by Akkara » Wed Jun 10, 2009 6:40 am

tbart wrote:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33"
It might have to do with your locale. If you are using a utf-8 encoding, '\xaa' is an invalid character. The first byte of a utf-8 character is never of the form '10xxxxxx'.

Sed reads full utf8 characters, which may span anywhere from 1 to 4 bytes, and treats those bytes as a unit, before trying to match things. To get a byte-by-byte substitution, it might work to use a locale that only has 8-bit characters (But I don't know I didn't try it).

Try this:

Code: Select all

echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33" | sed 's:3\(.\):3<\1>:'
The two bytes, '\xc3\xb7' represent the division character in utf-8, which sed treats as the single character it is, and matches it to '.'
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Wed Jun 10, 2009 7:51 am

truc wrote:
jw5801 wrote:The problem is your regexp. '.*' matches with any number of characters, including zero!
Yeah but as you said in your EDIT
sed is `greedy'
, and so the longuest expression matching this regex is from the first 3 to the end of the line.
So since sed will match the first result it finds, it matches with the 3 by itself, not including anything after it.
so this should be wrong
...
I was about to ask if you really tried this, but
EDIT: Actually, rereading part I was answering, you want 123ª123 to turn into 12SUBST. Your sed expression should do this (and does on my machine), sed is `greedy' and will match with the longest expression it can. If it's not doing that you can explicitly state it as `3.{0,}' (remember to escape the braces \{\}) which is exactly what the `*' operator performs.
It looks like you did, so next question is, which sed are you using? I can't replicate this behaviour with gnused-4.2
My bad, didn't completely correct myself the first time around. But yes, I did try it, and that is the desired output from sed, so either there is a bug (highly unlikely) or something else, such as the locale, is interfering with it's ability to interpret single bytes.
Sorry, I should answer the actual question, shouldn't I?

[\x00-\xFF] is along the right track. However it just includes every possible byte value. I'm assuming from your description that you want to find anything that isn't a regular character.
Here we go again, does this [\x00-\xFF] actually works for you?

Code: Select all

sed 's/[\x00-\xFF]*//' <<< ''
sed: -e expression #1, char 16: Invalid collation character

Anyway, since we were talking about non-printable characters, I tried this

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed  's/[^[:print:]]/SUBS/'
123�123
But as you can see, it also didn't work :S


I've found that bbe is probably the tool you'll need, unless we found something new with sed :)

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]*#SUBST#'
SUBST
If you're looking on a byte-by-byte basis, 0x00-0xFF covers the entire possible range of bytes. And as you can see, substitutes quite nicely.

I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST
If \xaa is not actually a valid character in UTF-8 encoding, that may explain the odd behaviour of sed.
Top
truc
Advocate
Advocate
User avatar
Posts: 3199
Joined: Mon Jul 25, 2005 9:24 am

  • Quote

Post by truc » Wed Jun 10, 2009 8:00 am

jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST
If \xaa is not actually a valid character in UTF-8 encoding, that may explain the odd behaviour of sed.
Wow, this is weird!

Code: Select all

locale
:?:
The End of the Internet!
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

Re: sed matching any ascii code/hex byte

  • Quote

Post by jw5801 » Wed Jun 10, 2009 8:02 am

Akkara wrote:
tbart wrote:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33"
It might have to do with your locale. If you are using a utf-8 encoding, '\xaa' is an invalid character. The first byte of a utf-8 character is never of the form '10xxxxxx'.

Sed reads full utf8 characters, which may span anywhere from 1 to 4 bytes, and treats those bytes as a unit, before trying to match things. To get a byte-by-byte substitution, it might work to use a locale that only has 8-bit characters (But I don't know I didn't try it).

Try this:

Code: Select all

echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33" | sed 's:3\(.\):3<\1>:'
The two bytes, '\xc3\xb7' represent the division character in utf-8, which sed treats as the single character it is, and matches it to '.'
Doesn't appear as a single character for me, but as you state, differing encodings treating bytes differently would explain that.

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33"
123÷123
jw@Andornor ~ $ echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33" | sed 's:3\(.\):3<\1>:'
123<Ã>·123
sed performs as I would expect it to here.
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Wed Jun 10, 2009 8:10 am

truc wrote:
jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST
If \xaa is not actually a valid character in UTF-8 encoding, that may explain the odd behaviour of sed.
Wow, this is weird!

Code: Select all

locale
:?:
I'm using en_AU ISO8859-1.
Top
tbart
Apprentice
Apprentice
Posts: 151
Joined: Sun Oct 31, 2004 4:14 pm

  • Quote

Post by tbart » Wed Jun 10, 2009 10:15 am

Ladies and Gentlemen,

I am confused.

WTF is going on here? Why do the same sed versions on different machines perform differently? Locales you say... Right.
I tried without UTF-8 now, because my box indeed is UTF-8 only.

Code: Select all

tbart@black_knight ~ $ LC_ALL=en_US.ISO-8859-1 LANG=en_US.ISO-8859-1 echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST<unprintable char>123
We see mixed problems here I guess.
This:
jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#' 
12SUBST
might be because your terminal emulator is not capable of printing the unprintable character. You could try to rule that out..
This might also explain why the division sign example does not work for you. Your terminal does not seem to be UTF-8 capable which additionally seems to be a different thing than the locale itself.

I am mostly with truc now. I have the same results as he has, same version of sed.

jw5801:
Did you try

Code: Select all

sed -e 's#3[\x00-\xFF]+#SUBST#'
?
Me (and truc) only get

Code: Select all

sed: -e expression #1, char 22: Invalid collation character
so this does not seem to be the way to go.

Just for further clarification:
I am operating on binary files (DV-DIF in AVI, that is) so I really have to match \x00-\xFF, so character classes like [:print:] (or the negation of it) won't do.
Precisely I want to match
\x60 <any four by tes> \x61 <any four by tes> \x62 \xFF \x1F \xFF \xFF \x63 <any four by tes>
and substitute the \x62 part with something different.

(The initiated might note: I am trying to change the DV-DIF's record date; I could (and that's the way I do it now) only search and replace the \x62 part, but I want to match as much as possible so I don't accidentally change the same sequence in actual video data; I know DV-DIF has a specified fixed structure and it screams for a C program but hey, where's the fun? No, seriously, I already have windoze dll code for doing this, but it's not my code, it's a dozen files and I am simply not that much into C programming as I am into bash scripting ;->)

I will take a look at bbe, but I'd like to use standard tools... Also because this might have to be portable to win32/cygwin which means any additional application is a hurdle more to take.

Where are the sed masters out there? (I initially thought I am one...)

Thanks for your great input!
th
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Wed Jun 10, 2009 11:02 am

tbart wrote:Ladies and Gentlemen,

I am confused.

WTF is going on here? Why do the same sed versions on different machines perform differently? Locales you say... Right.
I tried without UTF-8 now, because my box indeed is UTF-8 only.

Code: Select all

tbart@black_knight ~ $ LC_ALL=en_US.ISO-8859-1 LANG=en_US.ISO-8859-1 echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST<unprintable char>123
We see mixed problems here I guess.
This:
jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#' 
12SUBST
might be because your terminal emulator is not capable of printing the unprintable character. You could try to rule that out..
This might also explain why the division sign example does not work for you. Your terminal does not seem to be UTF-8 capable which additionally seems to be a different thing than the locale itself.
It is not because the terminal cannot print the character -- it's a perfectly valid character in the ISO 8859 encoding, and I can see it no problems straight from the echo statement. Your output is not the expected output, quite probably because the change to environment variables does not persist across the pipe (|), that would be my guess at an explanation anyway.
I am mostly with truc now. I have the same results as he has, same version of sed.

jw5801:
Did you try

Code: Select all

sed -e 's#3[\x00-\xFF]+#SUBST#'
?
Me (and truc) only get

Code: Select all

sed: -e expression #1, char 22: Invalid collation character
so this does not seem to be the way to go.
Once again, that one is my bad. The `+' delimiter is perfectly valid in most regular expressions, but for some reason it would appear that sed doesn't handle it. It's just a quicker way of writing `\{1,\}' anyhow, so try that. That would be why the more recent version of sed is erroring (my version simply ignores it and doesn't give correct output). My suggestion would be to try:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
Which should be the same thing, but isn't. As (at least in ISO 8859 encoding) one character corresponds to one byte (two hex numbers), [\x00-\xff] covers the entire range of possible characters (whether they are printable, readable or whatever), and the repetition created by \{1,\} means it will match one or more of these characters, therefore matching the entire line, so the output you get should just be `SUBST'. In UTF-8, each character is not necessarily one byte long, so this assumption does not hold and you will get garbled output, as we have been observing.
Just for further clarification:
I am operating on binary files (DV-DIF in AVI, that is) so I really have to match \x00-\xFF, so character classes like [:print:] (or the negation of it) won't do.
Precisely I want to match
\x60 <any four by tes> \x61 <any four by tes> \x62 \xFF \x1F \xFF \xFF \x63 <any four by tes>
and substitute the \x62 part with something different.

(The initiated might note: I am trying to change the DV-DIF's record date; I could (and that's the way I do it now) only search and replace the \x62 part, but I want to match as much as possible so I don't accidentally change the same sequence in actual video data; I know DV-DIF has a specified fixed structure and it screams for a C program but hey, where's the fun? No, seriously, I already have windoze dll code for doing this, but it's not my code, it's a dozen files and I am simply not that much into C programming as I am into bash scripting ;->)

I will take a look at bbe, but I'd like to use standard tools... Also because this might have to be portable to win32/cygwin which means any additional application is a hurdle more to take.

Where are the sed masters out there? (I initially thought I am one...)

Thanks for your great input!
th
As stated earlier, one UTF-8 character is anywhere from 1 to 4 bytes long, so you really don't want to be doing this in UTF-8. I'd recommend setting your locale in the script you're using to run this and see if that helps. The sed expression I'd use for what you're after is:

Code: Select all

s:\(\x60[\x00-\xFF]\{4\}\x61[\x00-\xFF]\{4\}\)\x62\(\xFF\x1F\xFF\xFF\x63[\x00-\xFF]\{4\}\):\1YOUR_SUBSTITUTION\2:
Which I can demonstrate as working on my machine for the format you describe with the following output:

Code: Select all

jw@Andornor ~ $ echo -e "\x60\x31\x32\x33\x34\x61\x41\x42\x43\x44\x62\xFF\x1F\xFF\xFF\x63\x51\x52\x53\x54"                                           
`1234aABCDbÿÿÿcQRST
jw@Andornor ~ $ echo -e "\x60\x31\x32\x33\x34\x61\x41\x42\x43\x44\x62\xFF\x1F\xFF\xFF\x63\x51\x52\x53\x54" | sed 's:\(\x60[\x00-\xFF]\{4\}\x61[\x00-\xFF]\{4\}\)\x62\(\xFF\x1F\xFF\xFF\x63[\x00-\xFF]\{4\}\):\1YOUR_SUBSTITUTION\2:'
`1234aABCDYOUR_SUBSTITUTIONÿÿÿcQRST
`\x62' is `b', so you can observe the change in that. `\x1F' is absolutely nothing (the characters start at `\x20'), so that's why there is one missing.
Top
truc
Advocate
Advocate
User avatar
Posts: 3199
Joined: Mon Jul 25, 2005 9:24 am

  • Quote

Post by truc » Wed Jun 10, 2009 12:03 pm

jw5801 wrote:Once again, that one is my bad. The `+' delimiter is perfectly valid in most regular expressions, but for some reason it would appear that sed doesn't handle it. It's just a quicker way of writing `\{1,\}'
You're messing up Basic Regular Expression and Extended Regular Expression, see the -r switch for sed in the manual
The End of the Internet!
Top
tbart
Apprentice
Apprentice
Posts: 151
Joined: Sun Oct 31, 2004 4:14 pm

  • Quote

Post by tbart » Wed Jun 10, 2009 12:09 pm

Thanks for hanging with me...

OK, so your terminal is not at fault, my bad, sorry.
However, the + vs. {1,} problem does not seem to be the real problem itself, both have to work I guess. At least on your machine they both do not produce an error like here and at truc's box.

The pipe definitely breaks the env variable set before, you're right.
So I defined the locale globally.
But see for yourself:

Code: Select all

tbart@black_knight ~ $ export LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ export LANG=en_US.ISO-8859-1
tbart@black_knight ~ $ locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
sed: -e expression #1, char 26: Invalid collation character
tbart@black_knight ~ $ sed --version
GNU sed version 4.1.5
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.
And thanks for the sed expression, I also have that. But it simply does not work as you can see above..
Any more hints as to why this works at your box and not at ours?

I saw just now that setting LC_ALL and LANG does not seem to be enough, as set still returns UTF-8 for the rest although locale says ISO8859-1 as noted above, interesting.
Nonetheless,

Code: Select all

tbart@black_knight ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do export LC_$i=en_US.ISO-8859-1; done
also does not help (set reports all LC_* variables as ISO8859-1 now...)

any hints?
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Wed Jun 10, 2009 2:16 pm

truc wrote:
jw5801 wrote:Once again, that one is my bad. The `+' delimiter is perfectly valid in most regular expressions, but for some reason it would appear that sed doesn't handle it. It's just a quicker way of writing `\{1,\}'
You're messing up Basic Regular Expression and Extended Regular Expression, see the -r switch for sed in the manual
Ah, right. Thanks for that.
tbart wrote:Thanks for hanging with me...

OK, so your terminal is not at fault, my bad, sorry.
However, the + vs. {1,} problem does not seem to be the real problem itself, both have to work I guess. At least on your machine they both do not produce an error like here and at truc's box.

The pipe definitely breaks the env variable set before, you're right.
So I defined the locale globally.
But see for yourself:

Code: Select all

tbart@black_knight ~ $ export LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ export LANG=en_US.ISO-8859-1
tbart@black_knight ~ $ locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
sed: -e expression #1, char 26: Invalid collation character
tbart@black_knight ~ $ sed --version
GNU sed version 4.1.5
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.
And thanks for the sed expression, I also have that. But it simply does not work as you can see above..
Any more hints as to why this works at your box and not at ours?

I saw just now that setting LC_ALL and LANG does not seem to be enough, as set still returns UTF-8 for the rest although locale says ISO8859-1 as noted above, interesting.
Nonetheless,

Code: Select all

tbart@black_knight ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do export LC_$i=en_US.ISO-8859-1; done
also does not help (set reports all LC_* variables as ISO8859-1 now...)

any hints?
I'm at a bit of a loss as to why this is happening! I can reproduce the error, but I can't reverse it short of creating a new login shell.

Actually, scrap that. Further interesting developments!

Code: Select all

jw@Andornor ~ $ export LANG=en_AU.iso88591
jw@Andornor ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do export LC_$i=en_AU.iso88591; done
jw@Andornor ~ $ locale
LANG=en_AU.iso88591
LC_CTYPE="en_AU.iso88591"
LC_NUMERIC="en_AU.iso88591"
LC_TIME="en_AU.iso88591"
LC_COLLATE="en_AU.iso88591"
LC_MONETARY="en_AU.iso88591"
LC_MESSAGES="en_AU.iso88591"
LC_PAPER="en_AU.iso88591"
LC_NAME="en_AU.iso88591"
LC_ADDRESS="en_AU.iso88591"
LC_TELEPHONE="en_AU.iso88591"
LC_MEASUREMENT="en_AU.iso88591"
LC_IDENTIFICATION="en_AU.iso88591"
LC_ALL=en_AU.iso88591
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
sed: -e expression #1, char 26: Invalid collation character
Exporting to what I believe my locale should be, produces the error. However, unsetting all locale variables returns it to a working state.

Code: Select all

jw@Andornor ~ $ unset LANG
jw@Andornor ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do unset LC_$i; done
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
SUBST
Perhaps give that a try? `locale' tells me that everything is set to POSIX if I do that.
Top
tbart
Apprentice
Apprentice
Posts: 151
Joined: Sun Oct 31, 2004 4:14 pm

  • Quote

Post by tbart » Wed Jun 10, 2009 3:36 pm

OK, POSIX works!

LC_ALL and LC_COLLATE have to be set to POSIX, then it seems my results are the same as yours.
Thanks a lot!

However: Can anyone pls explain why your ISO-8859-1 also works (at least when not setting it manually) and mine does not?

Guess I'll set everything to POSIX in my script, just to be on the safe side...
Top
truc
Advocate
Advocate
User avatar
Posts: 3199
Joined: Mon Jul 25, 2005 9:24 am

  • Quote

Post by truc » Wed Jun 10, 2009 5:11 pm

weird, this doesn't work for me, even after unsetting LANG and exporting LC_ALL&LC_COLLATE (='POSIX')


Anyway, usually one don't use sed to edit files. -i switch is a GNU extension, but that's not the problem here( AFAIK it will do something like sed 'cmd' bigfile > bigfile_result then mv bigfile_result bigfile. And that's not really in-place editing..) Since you seem to like sed, you definitely should look into ed the text editor. It's really nice and hum... fun! ;)

Don't forget to tag your thread if [SOLVED] if you think it is.
Last edited by truc on Mon Jun 29, 2009 4:02 pm, edited 1 time in total.
The End of the Internet!
Top
tbart
Apprentice
Apprentice
Posts: 151
Joined: Sun Oct 31, 2004 4:14 pm

  • Quote

Post by tbart » Mon Jun 29, 2009 11:19 am

edi with an i?
I only know ed, the non-visual vi and yes, it's fun.
I sometimes use ed with here-scripts in my bash apps but for simple substitutions, sed is more comfortable I think..

Thanks for the tip however, I didn't actually think of ed in this situation.

However, I don't understand why setting the locale to POSIX does not work for you... This might still be something I have to watch out for, as my script should work on different systems..
Top
truc
Advocate
Advocate
User avatar
Posts: 3199
Joined: Mon Jul 25, 2005 9:24 am

  • Quote

Post by truc » Mon Jun 29, 2009 4:19 pm

tbart wrote:edi with an i?
Oops, I meant ed
However, I don't understand why setting the locale to POSIX does not work for you... This might still be something I have to watch out for, as my script should work on different systems..
This may be of some help:

Code: Select all

$ POSIXLY_CORRECT=oeu sed -r 's/[\x00-\xff]//' <<< '' 
sed: -e expression #1, char 15: Invalid range end
$ POSIXLY_CORRECT=oeu LC_ALL="POSIX" sed -r 's/[\x00-\xff]//' <<< ''
But this still doesn't work for me here:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" POSIXLY_CORRECT=oeu sed -e 's#[\x00-\xFF]#SUBST#'
SUBST23�123
and

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" POSIXLY_CORRECT=oeu sed -e 's#3[\x00-\xFF]#SUBST#'
123�123
The End of the Internet!
Top
tbart
Apprentice
Apprentice
Posts: 151
Joined: Sun Oct 31, 2004 4:14 pm

  • Quote

Post by tbart » Tue Jun 30, 2009 2:16 pm

Ok, I think we're diving deep into *nix locale hell (or is it heaven?).

First, I do not really understand what POSIXLY_CORRECT=* does and if it really has something to do with character width (I doubt it).

Second, you might try to really set everything to POSIX. If that does not work, put it in a script and try again. I think I also could not make it behave correctly on the commandline. Possibly you need the variables set for *both* commands (the echo and the sed part), else there might be some char width mismatching problem again... (only a shot in the dark...)

Well, just tested it, at least here this is not the case.

The commands you mentioned produce the same output here as well

Maybe the following works for you?

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" sed -e 's#3[\x00-\xFF]#SUBST#'
12SUBST123
It does here...
Top
jw5801
Apprentice
Apprentice
User avatar
Posts: 251
Joined: Thu Jun 12, 2008 8:33 am
Location: Melbourne, Australia

  • Quote

Post by jw5801 » Tue Jun 30, 2009 2:31 pm

tbart wrote:Ok, I think we're diving deep into *nix locale hell (or is it heaven?).

...

The commands you mentioned produce the same output here as well

Maybe the following works for you?

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" sed -e 's#3[\x00-\xFF]#SUBST#'
12SUBST123
It does here...
Yeah, the `POSIXLY_CORRECT' bit screws it up on my system as well. Not sure what it's meant to achieve? What happens without it?
Top
Post Reply

20 posts • Page 1 of 1

Return to “Other Things Gentoo”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy

 

 

magic