Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Portage & Programming
  • Search

grepping several values from a html-page?

Problems with emerge or ebuilds? Have a basic programming question about C, PHP, Perl, BASH or something else?
Post Reply
Advanced search
13 posts • Page 1 of 1
Author
Message
Strowi
l33t
l33t
User avatar
Posts: 656
Joined: Tue Aug 19, 2003 3:19 pm
Location: Bonn
Contact:
Contact Strowi
Website

grepping several values from a html-page?

  • Quote

Post by Strowi » Wed Jan 06, 2010 12:01 pm

hi,

i am trying to get several values off a website to some bash-script...
this is the part from the <table> where i want to get some values.
The first line is static, followed by 1 or more rows where i want to get the content of the table-data fields...
Anyone got a clue as what the best approach would be?

Code: Select all

<td width="116" bgcolor="#FFCC00"><b>Status</b></td> 
</tr> 
<tr valign="top"> 
<td width="57" bgcolor="#FFFFFF"><img border="0" height="1" width="57" src="../art/pix_white.gif"></td>
<td width="117" bgcolor="">Info. Abgangs-Auswechslungsstelle&nbsp;</td>
<td width="117" bgcolor="">2010-01-06&nbsp;</td><td width="116" bgcolor="">allgemeiner Status.</td> 
</tr> 
</table> 
--
Linux & such ...
http://blog.hasnoname.de
Top
Naib
Watchman
Watchman
User avatar
Posts: 6101
Joined: Fri May 21, 2004 9:42 pm
Location: Removed by Neddy
Contact:
Contact Naib
Website

  • Quote

Post by Naib » Wed Jan 06, 2010 12:08 pm

dont use grep
#define HelloWorld int
#define Int main()
#define Return printf
#define Print return
#include <stdio>
HelloWorld Int {
Return("Hello, world!\n");
Print 0;
Top
cach0rr0
Bodhisattva
Bodhisattva
User avatar
Posts: 4123
Joined: Thu Nov 13, 2008 11:14 pm
Location: Houston, Republic of Texas

  • Quote

Post by cach0rr0 » Wed Jan 06, 2010 12:23 pm

personally I'd do the whole bit in perl if possible

maybe you could run a few lines of sed over it? still think perl would handle it more gracefully, since you could easily load the whole page into a scalar, remove all but TD tags, then remove all but what's between remaining tags

if that makes sense
Top
durian
Guru
Guru
User avatar
Posts: 312
Joined: Wed Jul 16, 2003 6:38 am
Location: Rörums Holma

  • Quote

Post by durian » Wed Jan 06, 2010 12:39 pm

cach0rr0 wrote:maybe you could run a few lines of sed over it? still think perl would handle it more gracefully, since you could easily load the whole page into a scalar, remove all but TD tags, then remove all but what's between remaining tags
Doesn't perl even have decent HTML parsers? That would make it even better, a hardcoded regex based on the contents could easily fail if they change a single byte in the file...

-peter
Top
Naib
Watchman
Watchman
User avatar
Posts: 6101
Joined: Fri May 21, 2004 9:42 pm
Location: Removed by Neddy
Contact:
Contact Naib
Website

  • Quote

Post by Naib » Wed Jan 06, 2010 12:53 pm

durian wrote:
cach0rr0 wrote:maybe you could run a few lines of sed over it? still think perl would handle it more gracefully, since you could easily load the whole page into a scalar, remove all but TD tags, then remove all but what's between remaining tags
Doesn't perl even have decent HTML parsers? That would make it even better, a hardcoded regex based on the contents could easily fail if they change a single byte in the file...

-peter
exactly.
don't regex over a markup language
#define HelloWorld int
#define Int main()
#define Return printf
#define Print return
#include <stdio>
HelloWorld Int {
Return("Hello, world!\n");
Print 0;
Top
cach0rr0
Bodhisattva
Bodhisattva
User avatar
Posts: 4123
Joined: Thu Nov 13, 2008 11:14 pm
Location: Houston, Republic of Texas

  • Quote

Post by cach0rr0 » Wed Jan 06, 2010 1:10 pm

durian wrote:
cach0rr0 wrote:maybe you could run a few lines of sed over it? still think perl would handle it more gracefully, since you could easily load the whole page into a scalar, remove all but TD tags, then remove all but what's between remaining tags
Doesn't perl even have decent HTML parsers? That would make it even better, a hardcoded regex based on the contents could easily fail if they change a single byte in the file...

-peter
they do, I'm just a bit hesitant to proffer heaps of info in case this is, say, for a school project or some such
dunno what his requirements are
Top
Strowi
l33t
l33t
User avatar
Posts: 656
Joined: Tue Aug 19, 2003 3:19 pm
Location: Bonn
Contact:
Contact Strowi
Website

  • Quote

Post by Strowi » Wed Jan 06, 2010 1:34 pm

whoa lots of replies for this short period..;)

No, it's not a school project, i need to grep some data from one of our webservers at the university. I am not that familar with perl, guess i will have to dig into that...
Already tried putting it through tiny and awk/sed from the resulting xml, i can even identify the td classes as tiny enumerates them (cXX), but how would i check if there are more than 1 entries ?

greetings,
strowi
--
Linux & such ...
http://blog.hasnoname.de
Top
mokona
n00b
n00b
Posts: 4
Joined: Thu Jan 25, 2007 1:08 pm

  • Quote

Post by mokona » Thu Jan 07, 2010 3:10 pm

Try "lynx -dump".
Then read each column with awk.
Top
Mike Hunt
Watchman
Watchman
User avatar
Posts: 5287
Joined: Sun Jul 19, 2009 11:01 pm

  • Quote

Post by Mike Hunt » Thu Jan 07, 2010 10:42 pm

You could try this:

Code: Select all

sed -n '/Status/,/\/table/p' filename.html
Top
pjp
Administrator
Administrator
User avatar
Posts: 20668
Joined: Tue Apr 16, 2002 10:35 pm

  • Quote

Post by pjp » Fri Jan 08, 2010 4:45 pm

Moved from Unsupported Software to Portage & Programming.
Quis separabit? Quo animo?
Top
Naib
Watchman
Watchman
User avatar
Posts: 6101
Joined: Fri May 21, 2004 9:42 pm
Location: Removed by Neddy
Contact:
Contact Naib
Website

  • Quote

Post by Naib » Fri Jan 08, 2010 5:04 pm

mokona wrote:Try "lynx -dump".
Then read each column with awk.
For such things I prefer curl or w3m
Then depending on how complex the output is either basic console commands (cut...) or regex commands (grep,sed).. or another lang (py,perl..)
#define HelloWorld int
#define Int main()
#define Return printf
#define Print return
#include <stdio>
HelloWorld Int {
Return("Hello, world!\n");
Print 0;
Top
marens
Apprentice
Apprentice
User avatar
Posts: 173
Joined: Thu Aug 05, 2004 6:35 pm

  • Quote

Post by marens » Fri Jan 08, 2010 10:04 pm

I'd say xpath is the way to go.

http://www.w3schools.com/XPath/xpath_syntax.asp

Combine with your favourite programming language.
If English was good enough for Jesus, then it's good enough for you!
Top
Mike Hunt
Watchman
Watchman
User avatar
Posts: 5287
Joined: Sun Jul 19, 2009 11:01 pm

  • Quote

Post by Mike Hunt » Mon Jan 11, 2010 3:10 pm

This works better, it will only match until the first occurence of table and exit:

Code: Select all

sed -n '/Status/,/table\?/p' file
Top
Post Reply

13 posts • Page 1 of 1

Return to “Portage & Programming”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy

 

 

magic