Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Archiving..
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Off the Wall
View previous topic :: View next topic  
Author Message
Bague
Apprentice
Apprentice


Joined: 09 Feb 2004
Posts: 292

PostPosted: Sat Aug 28, 2004 8:47 pm    Post subject: Archiving.. Reply with quote

Well, due to certain things I wish to do, I am looking in to a way of archiving several websites weekly, and having them fully usable on my computer.

I know I can use wget -m (sitename) with -w (wait time, I don't like to slam peoples servers), but still there are a few things I wonder.

1) I have no clue how to work with crontabs( I use vixie-cron), I believe it uses BASH scripting, which I know none of, is there a simply way of executing the wget ... coimmand weekly?

2) Some of the websites I want to archive have (for instance posts or background etc), that are for members only, which in the case I am for such sites. Yet those pages are (for obvious reasons) covered by robots.txt. If there a way to get it so it knows that I am a member and will archive members only stuff?

I want to do this for several different reasons, such as nostolgia, because I have several friends that have websites, and it would be cool to bring them up 10-20 years from now. Also, I am sure I am not the only one who has seen a great site go under due to lack of money, etc.
Back to top
View user's profile Send private message
tln
Veteran
Veteran


Joined: 24 Sep 2003
Posts: 1501

PostPosted: Sat Aug 28, 2004 8:57 pm    Post subject: Reply with quote

Vixie-cron uses whatever shell you set it to use with the SHELL variable in your crontab.

Just executing the wget command each week is easy to do.

EDIT: forgot the code lol

Code:
* * * * 0 wget -whatver your params were

This runs the wget command every Sunday.

Use crontab -e to edit your users crontab. You must be a member of the cron group.
Back to top
View user's profile Send private message
Bague
Apprentice
Apprentice


Joined: 09 Feb 2004
Posts: 292

PostPosted: Sat Aug 28, 2004 9:15 pm    Post subject: Reply with quote

Thanks for the info, one question though, what happens if your computer isn't on at all on the day the crontab is listed? Also, what if that user is not logged in but another is? Lastly, is there a way to make it wget to a specific directory?
Back to top
View user's profile Send private message
denstark
l33t
l33t


Joined: 02 Jun 2003
Posts: 654
Location: sd.ca.us

PostPosted: Sat Aug 28, 2004 9:19 pm    Post subject: Reply with quote

Simply make a shell script that uses wget, then set the crontab to run it once per week... take a look at this link for more information: http://www.adminschoice.com/docs/crontab.htm

And this link for shell scripting (Its easy as hell) http://gd.tuwien.ac.at/linuxcommand.org/writing_shell_scripts.html#contents




Den
_________________
Blog
Code:
denstark> starbuck authorizes torture?
rokstar> sure they do, you tried their coffee?
Back to top
View user's profile Send private message
tln
Veteran
Veteran


Joined: 24 Sep 2003
Posts: 1501

PostPosted: Sat Aug 28, 2004 9:26 pm    Post subject: Reply with quote

Note that some systems use 0-6 for week days, while others use 1-7. I don't know what Gentoo uses though, so my example might be wrong.
Back to top
View user's profile Send private message
thunderlove
Tux's lil' helper
Tux's lil' helper


Joined: 24 Aug 2004
Posts: 76
Location: Sitting on a stool somewhere in southern oregon

PostPosted: Sun Aug 29, 2004 1:14 pm    Post subject: Reply with quote

Bague wrote:
Thanks for the info, one question though, what happens if your computer isn't on at all on the day the crontab is listed? Also, what if that user is not logged in but another is? Lastly, is there a way to make it wget to a specific directory?


Actually, the computer will run the cronjobs next chance it gets -- the sys-apps/cronbase package adds a script /usr/sbin/run-crons to take care of that.
_________________
Registered Linux User #165104
Back to top
View user's profile Send private message
thunderlove
Tux's lil' helper
Tux's lil' helper


Joined: 24 Aug 2004
Posts: 76
Location: Sitting on a stool somewhere in southern oregon

PostPosted: Sun Aug 29, 2004 1:55 pm    Post subject: Re: Archiving.. Reply with quote

Bague wrote:
2) Some of the websites I want to archive have (for instance posts or background etc), that are for members only, which in the case I am for such sites. Yet those pages are (for obvious reasons) covered by robots.txt. If there a way to get it so it knows that I am a member and will archive members only stuff?


Edit /etc/wget/wgetrc to ignore robots.txt. The other options in the file can be set via the command line.

Open two terminals, 'wget --help | less' in one, and your editor in the other, and start adding command-line options

Your final script might look something like this:

Code:
#/bin/sh
WEBPAGE="$1"
shift
wget --user-agent "Mozilla/5.0 (linux; en)"  -wget $@ \
        --tries=20 --timestamping --timeout=120 --waitretry=30 --random-wait \
        --force-directories --directory-prefix=/var/cache/mirror \
        --html-extension --user-agent "Mozilla/5.0 (Linux; en)" \
        --page-requisites $WEBPAGE


To run this, pass the desired webpage as the first parameter, followed by other site-specific options.

Test it (carefully!) refine it, and your done!

For sites with authentication, use 'http://user:password@host.domain/blah/blah"

There are options to recursively grab all the pages on the website (--recursive), (if you just want to get all the pages on the website, --mirror has good settings), limit wget's bandwidth usage (--limit-rate=rate), even load and save cookies (just point it to your mozilla/firefox cookie.txt)

'info wget' will get you more documentation than you can shake a live penguin at!

(Make sure you read the section on --no-clobber in the info pages. You would probably want to use it if you are using the --recursive [-r] option, but NOT if you're not also using --timestamping ['-N'])
_________________
Registered Linux User #165104
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Off the Wall All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum