Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
stripping out all the <a href="..."> tags in
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Off the Wall
View previous topic :: View next topic  
Author Message
simcop2387
Apprentice
Apprentice


Joined: 14 Aug 2002
Posts: 200
Location: Galactic Sector ZZ9 Plural Z Alpha

PostPosted: Sat Sep 07, 2002 6:31 pm    Post subject: stripping out all the <a href="..."> tags in Reply with quote

ok i need to strip out all the tags out of an html doc (including all the other text), and leave just <a href="link1.html"><a href="link2.html">, etc., any ideas? i plan on using wget to download them all, i can easily strip the '<a href="' and '">' parts since they have a given length, but the html code has no pattern in where the links are, i was thinking that maybe using sed gawk or something like that and a reg exp could work, but i'm a n00b with reg exp's
Back to top
View user's profile Send private message
kirill
Apprentice
Apprentice


Joined: 01 Aug 2002
Posts: 183
Location: Finland

PostPosted: Sat Sep 07, 2002 7:21 pm    Post subject: Re: stripping out all the <a href="..."> tag Reply with quote

simcop2387 wrote:
ok i need to strip out all the tags out of an html doc (including all the other text), and leave just <a href="link1.html"><a href="link2.html">, etc., any ideas? i plan on using wget to download them all, i can easily strip the '<a href="' and '">' parts since they have a given length, but the html code has no pattern in where the links are, i was thinking that maybe using sed gawk or something like that and a reg exp could work, but i'm a n00b with reg exp's


This is not really an answer to your question but since you are going to use wget you could just recursively suck the page to yourself (wget -r).

If there is any similarity between the files/pages you want (*.jpg *.gif *.mpg) you can tell wget only to get these files (-A, --accept=LIST; -R, --reject=LIST)

Wget is pretty powerful tool, there is really no need to hack into the html code to get it working with wget ;)
_________________
--kirill
Back to top
View user's profile Send private message
rac
Bodhisattva
Bodhisattva


Joined: 30 May 2002
Posts: 6553
Location: Japanifornia

PostPosted: Mon Sep 09, 2002 3:09 am    Post subject: Reply with quote

HTML is very hard to process with straight regular expressions. If you have not found another solution to this problem yet, here's a lightly tested one:
Code:
#! /usr/bin/perl -w

use Carp;
use HTML::Parser;

my $parser = HTML::Parser->new( api_version => 3 );
$parser->handler( start => \&handle_a, 'attr' );
$parser->report_tags( 'a' );

for my $fn ( @ARGV ) {
   $parser->parse_file( $fn ) or croak( "$! on $fn" );
}

sub handle_a {
   my ($attrs) = @_;
   exists( $attrs->{href} ) and print "$attrs->{href}\n";
}

Given one or more filenames as arguments on the command line, it will print out a list of URLs that appear in links, one to a line. Modifying this to retrieve documents from a URL would be very simple.
_________________
For every higher wall, there is a taller ladder
Back to top
View user's profile Send private message
ghost_o
Tux's lil' helper
Tux's lil' helper


Joined: 10 Jul 2002
Posts: 119

PostPosted: Mon Sep 09, 2002 4:16 am    Post subject: Reply with quote

I agree - HTML::Parser and HTML2txt work well together. I just wrote a script to do 4500+ search engine searches and do just that using a mix of those 2 and some regexp(s).

-G
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Off the Wall All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum