Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Portage & Programming
  • Search

regex

Problems with emerge or ebuilds? Have a basic programming question about C, PHP, Perl, BASH or something else?
Post Reply
Advanced search
11 posts • Page 1 of 1
Author
Message
shrndegruv
l33t
l33t
User avatar
Posts: 658
Joined: Sun Sep 18, 2005 6:56 am

regex

  • Quote

Post by shrndegruv » Thu Jul 09, 2009 4:55 pm

all

what is a good regex to identify all ampersands in a string that are not part of a valid xml entity.

so

S&P 500 would be found but

&
"
'
<
>

would be ignored. In java.
long is the way, and hard, that out of hell leads up to light...
Top
RageOfOrder
Tux's lil' helper
Tux's lil' helper
User avatar
Posts: 99
Joined: Sun Aug 06, 2006 10:48 pm
Location: EH?!?!

Re: regex

  • Quote

Post by RageOfOrder » Thu Jul 09, 2009 5:11 pm

shrndegruv wrote:all

what is a good regex to identify all ampersands in a string that are not part of a valid xml entity.

so

S&P 500 would be found but

&
"
'
<
>

would be ignored. In java.
What exactly do you want it to return? Just the &'s? the word they are in? Remove them?

Code: Select all

s/\&(?!((amp)|(quot)|(apos)|(lt)|(gt));)//g;  
would remove them just fine.
...And then stuff happened.
Top
shrndegruv
l33t
l33t
User avatar
Posts: 658
Joined: Sun Sep 18, 2005 6:56 am

  • Quote

Post by shrndegruv » Thu Jul 09, 2009 5:17 pm

i want to remove them.

i need that to be a valid java regex, which is different then perl.

I think I get what you are doing, but i don't get the "(?!" part.
long is the way, and hard, that out of hell leads up to light...
Top
RageOfOrder
Tux's lil' helper
Tux's lil' helper
User avatar
Posts: 99
Joined: Sun Aug 06, 2006 10:48 pm
Location: EH?!?!

  • Quote

Post by RageOfOrder » Thu Jul 09, 2009 5:24 pm

shrndegruv wrote:i want to remove them.

i need that to be a valid java regex, which is different then perl.

I think I get what you are doing, but i don't get the "(?!" part.
Ok, well I've never done regex in Java but I can't imagine regex differs all that much from language to language

The regex I gave basically says

Code: Select all

s/\&(?!((amp)|(quot)|(apos)|(lt)|(gt))\;)//g
Look for an ampersand (\&) followed by something that is not amp; quot; apos; lt; or gt;
The ?! means match anything except whatever comes next in the expression, which is a list of all the things I want to ignore.

That's about it.
...And then stuff happened.
Top
shrndegruv
l33t
l33t
User avatar
Posts: 658
Joined: Sun Sep 18, 2005 6:56 am

  • Quote

Post by shrndegruv » Thu Jul 09, 2009 5:31 pm

i actually copied it verbatim and it worked, even though the ?! construct is not defined in the docs.

Sweet, thanx!
long is the way, and hard, that out of hell leads up to light...
Top
avx
Advocate
Advocate
User avatar
Posts: 2152
Joined: Mon Jun 21, 2004 4:06 am

  • Quote

Post by avx » Thu Jul 09, 2009 7:07 pm

Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;
Top
szczerb
Veteran
Veteran
Posts: 1709
Joined: Sat Feb 24, 2007 9:54 pm
Location: Poland => Lodz

  • Quote

Post by szczerb » Thu Jul 09, 2009 7:14 pm

ph030 wrote:Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;
That would also match any string with &, then any amount of anything (except line breaks if the DOTALL flag is not set) and then a ;. So way too much stuff would get caught. Something more along the lines of "&[a-z]{1,4};" would be better for a more general (as in "open for new stuff that I can't hardcode cause I don't know it yet") expression.
Top
jho
Apprentice
Apprentice
User avatar
Posts: 153
Joined: Thu May 24, 2007 5:47 am
Location: Jyväskylä, Finland
Contact:
Contact jho
Website

  • Quote

Post by jho » Thu Jul 09, 2009 7:24 pm

szczerb wrote:
ph030 wrote:Since all html-entities are &*; you could most possible strip out the checks for amp, lt and the like and just replace the regex with &.*;
That would also match any string with &, then any amount of anything (except line breaks if the DOTALL flag is not set) and then a ;. So way too much stuff would get caught. Something more along the lines of "&[a-z]{1,4};" would be better for a more general (as in "open for new stuff that I can't hardcode cause I don't know it yet") expression.
HTML expressions can be at least five characters after the ampersand (e.g. »), and also contain # and numbers (e.g. & #32; is a space). They can also be uppercase, I think.

So something along the lines of /&[\w\d#]{1,6};/ could be better.
"I'm sorry, I only accept criticism in the form of sed expressions."
Top
szczerb
Veteran
Veteran
Posts: 1709
Joined: Sat Feb 24, 2007 9:54 pm
Location: Poland => Lodz

  • Quote

Post by szczerb » Thu Jul 09, 2009 7:28 pm

Haven't used html for a long time, so I just went with the examples given :)
Top
avx
Advocate
Advocate
User avatar
Posts: 2152
Joined: Mon Jun 21, 2004 4:06 am

  • Quote

Post by avx » Thu Jul 09, 2009 7:46 pm

Oops, of course, one should make it less greedy, so yes, something like @jho's example is better.
Top
desultory
Bodhisattva
Bodhisattva
User avatar
Posts: 9410
Joined: Fri Nov 04, 2005 6:07 pm

  • Quote

Post by desultory » Fri Jul 10, 2009 7:46 am

Moved from Off the Wall to Portage & Programming, as this rather neatly fits the mandate.
Top
Post Reply

11 posts • Page 1 of 1

Return to “Portage & Programming”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy