regex

Message

shrndegruv · Post by **shrndegruv** » Thu Jul 09, 2009 4:55 pm

all

what is a good regex to identify all ampersands in a string that are not part of a valid xml entity.

so

S&P 500 would be found but

&
"
'
<
>

would be ignored. In java.

RageOfOrder · Post by **RageOfOrder** » Thu Jul 09, 2009 5:11 pm

shrndegruv wrote:all

what is a good regex to identify all ampersands in a string that are not part of a valid xml entity.

so

S&P 500 would be found but

&
"
'
<
>

would be ignored. In java.

What exactly do you want it to return? Just the &'s? the word they are in? Remove them?

Code: Select all

s/\&(?!((amp)|(quot)|(apos)|(lt)|(gt));)//g;

would remove them just fine.

shrndegruv · Post by **shrndegruv** » Thu Jul 09, 2009 5:17 pm

i want to remove them.

i need that to be a valid java regex, which is different then perl.

I think I get what you are doing, but i don't get the "(?!" part.

RageOfOrder · Post by **RageOfOrder** » Thu Jul 09, 2009 5:24 pm

shrndegruv wrote:i want to remove them.

i need that to be a valid java regex, which is different then perl.

I think I get what you are doing, but i don't get the "(?!" part.

Ok, well I've never done regex in Java but I can't imagine regex differs all that much from language to language

The regex I gave basically says

Code: Select all

s/\&(?!((amp)|(quot)|(apos)|(lt)|(gt))\;)//g

Look for an ampersand (\&) followed by something that is not amp; quot; apos; lt; or gt;
The ?! means match anything except whatever comes next in the expression, which is a list of all the things I want to ignore.

That's about it.

shrndegruv · Post by **shrndegruv** » Thu Jul 09, 2009 5:31 pm

i actually copied it verbatim and it worked, even though the ?! construct is not defined in the docs.

Sweet, thanx!

avx · Post by **avx** » Thu Jul 09, 2009 7:07 pm

Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;

szczerb · Post by **szczerb** » Thu Jul 09, 2009 7:14 pm

ph030 wrote:Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;

That would also match any string with &, then any amount of anything (except line breaks if the DOTALL flag is not set) and then a ;. So way too much stuff would get caught. Something more along the lines of "&[a-z]{1,4};" would be better for a more general (as in "open for new stuff that I can't hardcode cause I don't know it yet") expression.

jho · Post by **jho** » Thu Jul 09, 2009 7:24 pm

szczerb wrote:
ph030 wrote:Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;
That would also match any string with &, then any amount of anything (except line breaks if the DOTALL flag is not set) and then a ;. So way too much stuff would get caught. Something more along the lines of "&[a-z]{1,4};" would be better for a more general (as in "open for new stuff that I can't hardcode cause I don't know it yet") expression.

HTML expressions can be at least five characters after the ampersand (e.g. »), and also contain # and numbers (e.g. & #32; is a space). They can also be uppercase, I think.

So something along the lines of /&[\w\d#]{1,6};/ could be better.

szczerb · Post by **szczerb** » Thu Jul 09, 2009 7:28 pm

Haven't used html for a long time, so I just went with the examples given

avx · Post by **avx** » Thu Jul 09, 2009 7:46 pm

Oops, of course, one should make it less greedy, so yes, something like @jho's example is better.

desultory · Post by **desultory** » Fri Jul 10, 2009 7:46 am

Moved from Off the Wall to Portage & Programming, as this rather neatly fits the mandate.

regex

regex

Re: regex