[SOLVED] sed matching any ascii code/hex byte

Message

tbart · Post by **tbart** » Tue Jun 09, 2009 4:05 pm

Now this is a tough nut to crack...

Sed is perfect. I love it more every day. It can even operate on binary files! I use it to substitute arbitrary byte sequences with others.. but:

Try this:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33"

easy, we output

Code: Select all

123<some unprintable character>123

suppose we want to substitute that character. easy:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#\xaa#SUBST#'

outputs

Code: Select all

123SUBST123

OK. But how do you substitute, say, 3<the unprintable character>123 with SUBST if you do not know it's \xaa? (It could be any unprintable char...)

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'

should do it, you say. Nope. This outputs

Code: Select all

12SUBST<the unprintable char>123

So it seems that . only matches printable characters, converse to what the man page says (any character).

Here goes the question:
What is the correct way to address "any byte value" or "any printable or non-printable value from 0x00 to 0xFF"?

(no, [\x00-\xff] also does not work..)

This has to be possible with sed, it can match any byte value as shown above with \xaa. But how do you address ALL of them?

Thanks a lot in advance!

th

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 3:47 am

The problem is your regexp. '.*' matches with any number of characters, including zero! So since sed will match the first result it finds, it matches with the 3 by itself, not including anything after it.

Instead, try this:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.\{1,\}#SUBST#'

'{1}' translates to any one character matching what comes before it (we need to escape the braces so bash doesn't do funny things with them and they actually make it to sed). Similarly {1,3} translates to any 1-3 characters matching what comes before. Doesn't have to be bounded either {1,} matches any one or more and {,3} matches any number up to 3. An example of this is the regexp `[0-9]{1,3}', which will match any 1-3 digit long decimal number. Therefore '.{1}' translates to any one character and this regexp will match with the 3ª and produce `12SUBST123'.

An excellent regexp resource can be found here.

EDIT: Actually, rereading part I was answering, you want 123ª123 to turn into 12SUBST. Your sed expression should do this (and does on my machine), sed is `greedy' and will match with the longest expression it can. If it's not doing that you can explicitly state it as `3.{0,}' (remember to escape the braces \{\}) which is exactly what the `*' operator performs.

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 4:33 am

Sorry, I should answer the actual question, shouldn't I?

[\x00-\xFF] is along the right track. However it just includes every possible byte value. I'm assuming from your description that you want to find anything that isn't a regular character. The regular characters finish up at 0x7F, so if you want to match anything after that (extended ASCII), then use the regexp: `[\x80-\xff]+'. The `+' matches with one or more of the previous token, performing the same operation as `[\x80-\xff]{1,}' (don't forget to escape the braces if you're using them though: \{\}).

truc · Post by **truc** » Wed Jun 10, 2009 5:54 am

jw5801 wrote:The problem is your regexp. '.*' matches with any number of characters, including zero!

Yeah but as you said in your EDIT

sed is `greedy'

, and so the longuest expression matching this regex is from the first 3 to the end of the line.

So since sed will match the first result it finds, it matches with the 3 by itself, not including anything after it.

so this should be wrong

...

I was about to ask if you really tried this, but

EDIT: Actually, rereading part I was answering, you want 123ª123 to turn into 12SUBST. Your sed expression should do this (and does on my machine), sed is `greedy' and will match with the longest expression it can. If it's not doing that you can explicitly state it as `3.{0,}' (remember to escape the braces \{\}) which is exactly what the `*' operator performs.

It looks like you did, so next question is, which sed are you using? I can't replicate this behaviour with gnused-4.2

Sorry, I should answer the actual question, shouldn't I?

[\x00-\xFF] is along the right track. However it just includes every possible byte value. I'm assuming from your description that you want to find anything that isn't a regular character.

Here we go again, does this [\x00-\xFF] actually works for you?

Code: Select all

sed 's/[\x00-\xFF]*//' <<< ''
sed: -e expression #1, char 16: Invalid collation character

Anyway, since we were talking about non-printable characters, I tried this

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed  's/[^[:print:]]/SUBS/'
123�123

But as you can see, it also didn't work :S

I've found that bbe is probably the tool you'll need, unless we found something new with sed

Akkara · Post by **Akkara** » Wed Jun 10, 2009 6:40 am

tbart wrote:
Code: Select all
echo -e "\x31\x32\x33\xaa\x31\x32\x33"

It might have to do with your locale. If you are using a utf-8 encoding, '\xaa' is an invalid character. The first byte of a utf-8 character is never of the form '10xxxxxx'.

Sed reads full utf8 characters, which may span anywhere from 1 to 4 bytes, and treats those bytes as a unit, before trying to match things. To get a byte-by-byte substitution, it might work to use a locale that only has 8-bit characters (But I don't know I didn't try it).

Try this:

Code: Select all

echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33" | sed 's:3\(.\):3<\1>:'

The two bytes, '\xc3\xb7' represent the division character in utf-8, which sed treats as the single character it is, and matches it to '.'

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 7:51 am

truc wrote:
jw5801 wrote:The problem is your regexp. '.*' matches with any number of characters, including zero!
Yeah but as you said in your EDIT
sed is `greedy'
, and so the longuest expression matching this regex is from the first 3 to the end of the line.
So since sed will match the first result it finds, it matches with the 3 by itself, not including anything after it.
so this should be wrong
...
I was about to ask if you really tried this, but

EDIT: Actually, rereading part I was answering, you want 123ª123 to turn into 12SUBST. Your sed expression should do this (and does on my machine), sed is `greedy' and will match with the longest expression it can. If it's not doing that you can explicitly state it as `3.{0,}' (remember to escape the braces \{\}) which is exactly what the `*' operator performs.
It looks like you did, so next question is, which sed are you using? I can't replicate this behaviour with gnused-4.2

My bad, didn't completely correct myself the first time around. But yes, I did try it, and that is the desired output from sed, so either there is a bug (highly unlikely) or something else, such as the locale, is interfering with it's ability to interpret single bytes.

Sorry, I should answer the actual question, shouldn't I?

[\x00-\xFF] is along the right track. However it just includes every possible byte value. I'm assuming from your description that you want to find anything that isn't a regular character.
Here we go again, does this [\x00-\xFF] actually works for you?
Code: Select all
sed 's/[\x00-\xFF]*//' <<< ''
sed: -e expression #1, char 16: Invalid collation character
Anyway, since we were talking about non-printable characters, I tried this
Code: Select all
echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed  's/[^[:print:]]/SUBS/'
123�123
But as you can see, it also didn't work :S

I've found that bbe is probably the tool you'll need, unless we found something new with sed

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]*#SUBST#'
SUBST

If you're looking on a byte-by-byte basis, 0x00-0xFF covers the entire possible range of bytes. And as you can see, substitutes quite nicely.

I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST

If \xaa is not actually a valid character in UTF-8 encoding, that may explain the odd behaviour of sed.

truc · Post by **truc** » Wed Jun 10, 2009 8:00 am

jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:
Code: Select all
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST
If \xaa is not actually a valid character in UTF-8 encoding, that may explain the odd behaviour of sed.

Wow, this is weird!

Code: Select all

locale

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 8:02 am

Akkara wrote:
tbart wrote:
Code: Select all
echo -e "\x31\x32\x33\xaa\x31\x32\x33"
It might have to do with your locale. If you are using a utf-8 encoding, '\xaa' is an invalid character. The first byte of a utf-8 character is never of the form '10xxxxxx'.

Sed reads full utf8 characters, which may span anywhere from 1 to 4 bytes, and treats those bytes as a unit, before trying to match things. To get a byte-by-byte substitution, it might work to use a locale that only has 8-bit characters (But I don't know I didn't try it).

Try this:
Code: Select all
echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33" | sed 's:3$.$:3<\1>:'
The two bytes, '\xc3\xb7' represent the division character in utf-8, which sed treats as the single character it is, and matches it to '.'

Doesn't appear as a single character for me, but as you state, differing encodings treating bytes differently would explain that.

Code: Select all

jw@Andornor ~ $ echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33"
123Ã·123
jw@Andornor ~ $ echo -e "\x31\x32\x33\xc3\xb7\x31\x32\x33" | sed 's:3\(.\):3<\1>:'
123<Ã>·123

sed performs as I would expect it to here.

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 8:10 am

truc wrote:
jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:
Code: Select all
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST
If \xaa is not actually a valid character in UTF-8 encoding, that may explain the odd behaviour of sed.
Wow, this is weird!
Code: Select all
locale

I'm using en_AU ISO8859-1.

tbart · Post by **tbart** » Wed Jun 10, 2009 10:15 am

Ladies and Gentlemen,

I am confused.

WTF is going on here? Why do the same sed versions on different machines perform differently? Locales you say... Right.
I tried without UTF-8 now, because my box indeed is UTF-8 only.

Code: Select all

tbart@black_knight ~ $ LC_ALL=en_US.ISO-8859-1 LANG=en_US.ISO-8859-1 echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST<unprintable char>123

We see mixed problems here I guess.
This:

jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:
Code: Select all
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#' 
12SUBST

might be because your terminal emulator is not capable of printing the unprintable character. You could try to rule that out..
This might also explain why the division sign example does not work for you. Your terminal does not seem to be UTF-8 capable which additionally seems to be a different thing than the locale itself.

I am mostly with truc now. I have the same results as he has, same version of sed.

jw5801:
Did you try

Code: Select all

sed -e 's#3[\x00-\xFF]+#SUBST#'

?
Me (and truc) only get

Code: Select all

sed: -e expression #1, char 22: Invalid collation character

so this does not seem to be the way to go.

Just for further clarification:
I am operating on binary files (DV-DIF in AVI, that is) so I really have to match \x00-\xFF, so character classes like [:print:] (or the negation of it) won't do.
Precisely I want to match
\x60 <any four by tes> \x61 <any four by tes> \x62 \xFF \x1F \xFF \xFF \x63 <any four by tes>
and substitute the \x62 part with something different.

(The initiated might note: I am trying to change the DV-DIF's record date; I could (and that's the way I do it now) only search and replace the \x62 part, but I want to match as much as possible so I don't accidentally change the same sequence in actual video data; I know DV-DIF has a specified fixed structure and it screams for a C program but hey, where's the fun? No, seriously, I already have windoze dll code for doing this, but it's not my code, it's a dozen files and I am simply not that much into C programming as I am into bash scripting ;->)

I will take a look at bbe, but I'd like to use standard tools... Also because this might have to be portable to win32/cygwin which means any additional application is a hurdle more to take.

Where are the sed masters out there? (I initially thought I am one...)

Thanks for your great input!
th

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 11:02 am

tbart wrote:Ladies and Gentlemen,

I am confused.

WTF is going on here? Why do the same sed versions on different machines perform differently? Locales you say... Right.
I tried without UTF-8 now, because my box indeed is UTF-8 only.
Code: Select all
tbart@black_knight ~ $ LC_ALL=en_US.ISO-8859-1 LANG=en_US.ISO-8859-1 echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#'
12SUBST<unprintable char>123
We see mixed problems here I guess.
This:
jw5801 wrote:I'm using GNU sed 4.1.5, and as you can see, the original command produces the desired output:
Code: Select all
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#3.*#SUBST#' 
12SUBST
might be because your terminal emulator is not capable of printing the unprintable character. You could try to rule that out..
This might also explain why the division sign example does not work for you. Your terminal does not seem to be UTF-8 capable which additionally seems to be a different thing than the locale itself.

It is not because the terminal cannot print the character -- it's a perfectly valid character in the ISO 8859 encoding, and I can see it no problems straight from the echo statement. Your output is not the expected output, quite probably because the change to environment variables does not persist across the pipe (|), that would be my guess at an explanation anyway.

I am mostly with truc now. I have the same results as he has, same version of sed.

jw5801:
Did you try
Code: Select all
sed -e 's#3[\x00-\xFF]+#SUBST#'
?
Me (and truc) only get
Code: Select all
sed: -e expression #1, char 22: Invalid collation character
so this does not seem to be the way to go.

Once again, that one is my bad. The `+' delimiter is perfectly valid in most regular expressions, but for some reason it would appear that sed doesn't handle it. It's just a quicker way of writing `\{1,\}' anyhow, so try that. That would be why the more recent version of sed is erroring (my version simply ignores it and doesn't give correct output). My suggestion would be to try:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'

Which should be the same thing, but isn't. As (at least in ISO 8859 encoding) one character corresponds to one byte (two hex numbers), [\x00-\xff] covers the entire range of possible characters (whether they are printable, readable or whatever), and the repetition created by \{1,\} means it will match one or more of these characters, therefore matching the entire line, so the output you get should just be `SUBST'. In UTF-8, each character is not necessarily one byte long, so this assumption does not hold and you will get garbled output, as we have been observing.

Just for further clarification:
I am operating on binary files (DV-DIF in AVI, that is) so I really have to match \x00-\xFF, so character classes like [:print:] (or the negation of it) won't do.
Precisely I want to match
\x60 <any four by tes> \x61 <any four by tes> \x62 \xFF \x1F \xFF \xFF \x63 <any four by tes>
and substitute the \x62 part with something different.

(The initiated might note: I am trying to change the DV-DIF's record date; I could (and that's the way I do it now) only search and replace the \x62 part, but I want to match as much as possible so I don't accidentally change the same sequence in actual video data; I know DV-DIF has a specified fixed structure and it screams for a C program but hey, where's the fun? No, seriously, I already have windoze dll code for doing this, but it's not my code, it's a dozen files and I am simply not that much into C programming as I am into bash scripting ;->)

I will take a look at bbe, but I'd like to use standard tools... Also because this might have to be portable to win32/cygwin which means any additional application is a hurdle more to take.

Where are the sed masters out there? (I initially thought I am one...)

Thanks for your great input!
th

As stated earlier, one UTF-8 character is anywhere from 1 to 4 bytes long, so you really don't want to be doing this in UTF-8. I'd recommend setting your locale in the script you're using to run this and see if that helps. The sed expression I'd use for what you're after is:

Code: Select all

s:\(\x60[\x00-\xFF]\{4\}\x61[\x00-\xFF]\{4\}\)\x62\(\xFF\x1F\xFF\xFF\x63[\x00-\xFF]\{4\}\):\1YOUR_SUBSTITUTION\2:

Which I can demonstrate as working on my machine for the format you describe with the following output:

Code: Select all

jw@Andornor ~ $ echo -e "\x60\x31\x32\x33\x34\x61\x41\x42\x43\x44\x62\xFF\x1F\xFF\xFF\x63\x51\x52\x53\x54"                                           
`1234aABCDbÿÿÿcQRST
jw@Andornor ~ $ echo -e "\x60\x31\x32\x33\x34\x61\x41\x42\x43\x44\x62\xFF\x1F\xFF\xFF\x63\x51\x52\x53\x54" | sed 's:\(\x60[\x00-\xFF]\{4\}\x61[\x00-\xFF]\{4\}\)\x62\(\xFF\x1F\xFF\xFF\x63[\x00-\xFF]\{4\}\):\1YOUR_SUBSTITUTION\2:'
`1234aABCDYOUR_SUBSTITUTIONÿÿÿcQRST

`\x62' is `b', so you can observe the change in that. `\x1F' is absolutely nothing (the characters start at `\x20'), so that's why there is one missing.

truc · Post by **truc** » Wed Jun 10, 2009 12:03 pm

jw5801 wrote:Once again, that one is my bad. The `+' delimiter is perfectly valid in most regular expressions, but for some reason it would appear that sed doesn't handle it. It's just a quicker way of writing `\{1,\}'

You're messing up Basic Regular Expression and Extended Regular Expression, see the -r switch for sed in the manual

tbart · Post by **tbart** » Wed Jun 10, 2009 12:09 pm

Thanks for hanging with me...

OK, so your terminal is not at fault, my bad, sorry.
However, the + vs. {1,} problem does not seem to be the real problem itself, both have to work I guess. At least on your machine they both do not produce an error like here and at truc's box.

The pipe definitely breaks the env variable set before, you're right.
So I defined the locale globally.
But see for yourself:

Code: Select all

tbart@black_knight ~ $ export LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ export LANG=en_US.ISO-8859-1
tbart@black_knight ~ $ locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
sed: -e expression #1, char 26: Invalid collation character
tbart@black_knight ~ $ sed --version
GNU sed version 4.1.5
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.

And thanks for the sed expression, I also have that. But it simply does not work as you can see above..
Any more hints as to why this works at your box and not at ours?

I saw just now that setting LC_ALL and LANG does not seem to be enough, as set still returns UTF-8 for the rest although locale says ISO8859-1 as noted above, interesting.
Nonetheless,

Code: Select all

tbart@black_knight ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do export LC_$i=en_US.ISO-8859-1; done

also does not help (set reports all LC_* variables as ISO8859-1 now...)

any hints?

jw5801 · Post by **jw5801** » Wed Jun 10, 2009 2:16 pm

truc wrote:
jw5801 wrote:Once again, that one is my bad. The `+' delimiter is perfectly valid in most regular expressions, but for some reason it would appear that sed doesn't handle it. It's just a quicker way of writing `\{1,\}'
You're messing up Basic Regular Expression and Extended Regular Expression, see the -r switch for sed in the manual

Ah, right. Thanks for that.

tbart wrote:Thanks for hanging with me...

OK, so your terminal is not at fault, my bad, sorry.
However, the + vs. {1,} problem does not seem to be the real problem itself, both have to work I guess. At least on your machine they both do not produce an error like here and at truc's box.

The pipe definitely breaks the env variable set before, you're right.
So I defined the locale globally.
But see for yourself:
Code: Select all
tbart@black_knight ~ $ export LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ export LANG=en_US.ISO-8859-1
tbart@black_knight ~ $ locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
tbart@black_knight ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
sed: -e expression #1, char 26: Invalid collation character
tbart@black_knight ~ $ sed --version
GNU sed version 4.1.5
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.
And thanks for the sed expression, I also have that. But it simply does not work as you can see above..
Any more hints as to why this works at your box and not at ours?

I saw just now that setting LC_ALL and LANG does not seem to be enough, as set still returns UTF-8 for the rest although locale says ISO8859-1 as noted above, interesting.
Nonetheless,
Code: Select all
tbart@black_knight ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do export LC_$i=en_US.ISO-8859-1; done
also does not help (set reports all LC_* variables as ISO8859-1 now...)

any hints?

I'm at a bit of a loss as to why this is happening! I can reproduce the error, but I can't reverse it short of creating a new login shell.

Actually, scrap that. Further interesting developments!

Code: Select all

jw@Andornor ~ $ export LANG=en_AU.iso88591
jw@Andornor ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do export LC_$i=en_AU.iso88591; done
jw@Andornor ~ $ locale
LANG=en_AU.iso88591
LC_CTYPE="en_AU.iso88591"
LC_NUMERIC="en_AU.iso88591"
LC_TIME="en_AU.iso88591"
LC_COLLATE="en_AU.iso88591"
LC_MONETARY="en_AU.iso88591"
LC_MESSAGES="en_AU.iso88591"
LC_PAPER="en_AU.iso88591"
LC_NAME="en_AU.iso88591"
LC_ADDRESS="en_AU.iso88591"
LC_TELEPHONE="en_AU.iso88591"
LC_MEASUREMENT="en_AU.iso88591"
LC_IDENTIFICATION="en_AU.iso88591"
LC_ALL=en_AU.iso88591
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
sed: -e expression #1, char 26: Invalid collation character

Exporting to what I believe my locale should be, produces the error. However, unsetting all locale variables returns it to a working state.

Code: Select all

jw@Andornor ~ $ unset LANG
jw@Andornor ~ $ for i in CTYPE MESSAGES MONETARY NUMERIC COLLATE TIME ALL; do unset LC_$i; done
jw@Andornor ~ $ echo -e "\x31\x32\x33\xaa\x31\x32\x33" | sed -e 's#[\x00-\xff]\{1,\}#SUBST#'
SUBST

Perhaps give that a try? `locale' tells me that everything is set to POSIX if I do that.

tbart · Post by **tbart** » Wed Jun 10, 2009 3:36 pm

OK, POSIX works!

LC_ALL and LC_COLLATE have to be set to POSIX, then it seems my results are the same as yours.
Thanks a lot!

However: Can anyone pls explain why your ISO-8859-1 also works (at least when not setting it manually) and mine does not?

Guess I'll set everything to POSIX in my script, just to be on the safe side...

truc · Post by **truc** » Wed Jun 10, 2009 5:11 pm

weird, this doesn't work for me, even after unsetting LANG and exporting LC_ALL&LC_COLLATE (='POSIX')

Anyway, usually one don't use sed to edit files. -i switch is a GNU extension, but that's not the problem here( AFAIK it will do something like sed 'cmd' bigfile > bigfile_result then mv bigfile_result bigfile. And that's not really in-place editing..) Since you seem to like sed, you definitely should look into ed the text editor. It's really nice and hum... fun!

Don't forget to tag your thread if [SOLVED] if you think it is.

tbart · Post by **tbart** » Mon Jun 29, 2009 11:19 am

edi with an i?
I only know ed, the non-visual vi and yes, it's fun.
I sometimes use ed with here-scripts in my bash apps but for simple substitutions, sed is more comfortable I think..

Thanks for the tip however, I didn't actually think of ed in this situation.

However, I don't understand why setting the locale to POSIX does not work for you... This might still be something I have to watch out for, as my script should work on different systems..

truc · Post by **truc** » Mon Jun 29, 2009 4:19 pm

tbart wrote:edi with an i?

Oops, I meant ed

However, I don't understand why setting the locale to POSIX does not work for you... This might still be something I have to watch out for, as my script should work on different systems..

This may be of some help:

Code: Select all

$ POSIXLY_CORRECT=oeu sed -r 's/[\x00-\xff]//' <<< '' 
sed: -e expression #1, char 15: Invalid range end
$ POSIXLY_CORRECT=oeu LC_ALL="POSIX" sed -r 's/[\x00-\xff]//' <<< ''

But this still doesn't work for me here:

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" POSIXLY_CORRECT=oeu sed -e 's#[\x00-\xFF]#SUBST#'
SUBST23�123

and

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" POSIXLY_CORRECT=oeu sed -e 's#3[\x00-\xFF]#SUBST#'
123�123

tbart · Post by **tbart** » Tue Jun 30, 2009 2:16 pm

Ok, I think we're diving deep into *nix locale hell (or is it heaven?).

First, I do not really understand what POSIXLY_CORRECT=* does and if it really has something to do with character width (I doubt it).

Second, you might try to really set everything to POSIX. If that does not work, put it in a script and try again. I think I also could not make it behave correctly on the commandline. Possibly you need the variables set for *both* commands (the echo and the sed part), else there might be some char width mismatching problem again... (only a shot in the dark...)

Well, just tested it, at least here this is not the case.

The commands you mentioned produce the same output here as well

Maybe the following works for you?

Code: Select all

echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" sed -e 's#3[\x00-\xFF]#SUBST#'
12SUBST123

It does here...

jw5801 · Post by **jw5801** » Tue Jun 30, 2009 2:31 pm

tbart wrote:Ok, I think we're diving deep into *nix locale hell (or is it heaven?).

...

The commands you mentioned produce the same output here as well

Maybe the following works for you?
Code: Select all
echo -e "\x31\x32\x33\xaa\x31\x32\x33" | LC_ALL="POSIX" sed -e 's#3[\x00-\xFF]#SUBST#'
12SUBST123
It does here...

Yeah, the `POSIXLY_CORRECT' bit screws it up on my system as well. Not sure what it's meant to achieve? What happens without it?

[SOLVED] sed matching any ascii code/hex byte

[SOLVED] sed matching any ascii code/hex byte

Re: sed matching any ascii code/hex byte

Re: sed matching any ascii code/hex byte