

What exactly do you want it to return? Just the &'s? the word they are in? Remove them?shrndegruv wrote:all
what is a good regex to identify all ampersands in a string that are not part of a valid xml entity.
so
S&P 500 would be found but
&
"
'
<
>
would be ignored. In java.
Code: Select all
s/\&(?!((amp)|(quot)|(apos)|(lt)|(gt));)//g; 

Ok, well I've never done regex in Java but I can't imagine regex differs all that much from language to languageshrndegruv wrote:i want to remove them.
i need that to be a valid java regex, which is different then perl.
I think I get what you are doing, but i don't get the "(?!" part.
Code: Select all
s/\&(?!((amp)|(quot)|(apos)|(lt)|(gt))\;)//g

That would also match any string with &, then any amount of anything (except line breaks if the DOTALL flag is not set) and then a ;. So way too much stuff would get caught. Something more along the lines of "&[a-z]{1,4};" would be better for a more general (as in "open for new stuff that I can't hardcode cause I don't know it yet") expression.ph030 wrote:Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;
HTML expressions can be at least five characters after the ampersand (e.g. »), and also contain # and numbers (e.g. & #32; is a space). They can also be uppercase, I think.szczerb wrote:That would also match any string with &, then any amount of anything (except line breaks if the DOTALL flag is not set) and then a ;. So way too much stuff would get caught. Something more along the lines of "&[a-z]{1,4};" would be better for a more general (as in "open for new stuff that I can't hardcode cause I don't know it yet") expression.ph030 wrote:Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;