Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Discussion & Documentation Gentoo Forums Feedback
  • Search

Is it possible to download the forum? For offline reading

Having a problem with the forums? Have a suggestion? Post here!
Post Reply
  • Print view
Advanced search
8 posts • Page 1 of 1
Author
Message
vitaly-zdanevich
Tux's lil' helper
Tux's lil' helper
Posts: 106
Joined: Sun Dec 01, 2019 4:40 pm
Location: Belarus
Contact:
Contact vitaly-zdanevich
Website

Is it possible to download the forum? For offline reading

  • Quote

Post by vitaly-zdanevich » Tue Jul 09, 2024 10:22 am

Hi.
Top
Banana
Administrator
Administrator
User avatar
Posts: 2387
Joined: Fri May 21, 2004 12:02 pm
Location: Germany
Contact:
Contact Banana
Website

  • Quote

Post by Banana » Tue Jul 09, 2024 6:19 pm

No.

You have to consult your favourite search engine to find a tool which can crawl websites and stores the pages as html files.
Forum Guidelines

PFL - Portage file list - find which package a file or command belongs to.
My delta-labs.org snippets do expire
Top
Hu
Administrator
Administrator
Posts: 24389
Joined: Tue Mar 06, 2007 5:38 am

  • Quote

Post by Hu » Tue Jul 09, 2024 9:56 pm

Before trying this, review very carefully what the tool will download. The forums are massive, and a careless attempt to spider the forums will download far more than you want, as well as likely putting enough load on the server to irritate the host.
Top
szatox
Advocate
Advocate
Posts: 3858
Joined: Tue Aug 27, 2013 12:35 pm

  • Quote

Post by szatox » Tue Jul 09, 2024 10:13 pm

robots.txt
wget's mirror mode respects robots, so it should be good if you put a speed limit on it. If you won't, you will certainly irritate infra guys.

Any idea why it bans magpie-crawler though? It's a funny way to ban a spider too; if one wants to misbehave, it can start with simply ignoring the kind request anyway.
Make Pipewire a system service
Top
Hu
Administrator
Administrator
Posts: 24389
Joined: Tue Mar 06, 2007 5:38 am

  • Quote

Post by Hu » Wed Jul 10, 2024 12:20 am

I was not involved in creation of that file, so I cannot comment on magpie. However, having looked at it, I don't think that exclusion is sufficient to address my earlier warning. Once a spider finds a topic, each topic has within it links to the individual posts of that thread, so a 25-post topic will be downloaded 26 times: once as a topic, and once as each of its 25 posts. As the number of posts increase, the problem gets worse. Those robots.txt exclusions might keep a crawler from getting topic index lists, but if the user provides a link to a topic, the crawler can hop from thread to thread using the next/previous links.
Top
szatox
Advocate
Advocate
Posts: 3858
Joined: Tue Aug 27, 2013 12:35 pm

  • Quote

Post by szatox » Wed Jul 10, 2024 12:54 am

No, I think the first part of robots.txt (for all browsers) is a good instruction for well behaved crawlers, since it does allow indexing of content with permanent addresses, while filtering out the volatile queries.
Links to individual posts shouldn't be a problem either, since that #part is only a reference to ID attribute in a html element within the same document. Clicking on those links doesn't refresh page either, browser just jumps to the position. Unless there are some other links to posts I'm not aware of?

It's the second part, for magpie, that baffles me.
Make Pipewire a system service
Top
Hu
Administrator
Administrator
Posts: 24389
Joined: Tue Mar 06, 2007 5:38 am

  • Quote

Post by Hu » Wed Jul 10, 2024 2:07 am

When viewing the thread, for each post, there is an icon that links to [post]8833110[/post], [post]8833113[/post], etc. Those are all separate URLs from the thread itself, although their content is largely duplicative. You are right that those also have a fragment to jump the browser to the specific post, but my concern is that a crawler that is trying to mirror everything would retrieve each of those [post] links individually.
Top
szatox
Advocate
Advocate
Posts: 3858
Joined: Tue Aug 27, 2013 12:35 pm

  • Quote

Post by szatox » Wed Jul 10, 2024 10:10 am

Wow, you got me on this one. It actually uses post ID in the path followed by post ID in element instead of the thread and page in path followed by post ID in element. Ok, now that IS actually a problem.
Make Pipewire a system service
Top
Post Reply
  • Print view

8 posts • Page 1 of 1

Return to “Gentoo Forums Feedback”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy

 

 

magic