Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Trick to save resource files from visited websites
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3138

PostPosted: Mon Feb 26, 2024 2:38 am    Post subject: Trick to save resource files from visited websites Reply with quote

I've been looking for some way to save files hidden behind javascript viewers. Here is what I eventually came up with.
The solution below is not complete, it is not perfect, and it is not even the most convenient thing in the world, but it can follow you to the more obscure websites which don't have dedicated scrapers, which is way better than nothing at all.

How it works:
We connect to the websites via a proxy which snoops on our traffic and dumps responses on disk.

What do we need:
mitmproxy
custom script for saving files (mitmproxy comes with a bunch of provided scrips; haven't found a suitable one)
custom CA certificate

How to do set fings up:
Code:
# Setup as root:
emerge mitmproxy


# Setup as user
# Create CA cert and bundle it for use with mitmproxy
openssl req -x509 -sha256 -days 30 -newkey rsa:2048 -keyout mitm-ca.key -out mitm-ca.pem -nodes # you can make the cert valid for any time you want, I chose to limit potential damage
cat mitm-ca.key mitm-ca.pem > mitmproxy-ca.pem

# Create a plugin for mitmproxy
cat > save.py << eof
import mitmproxy
import os
from pathlib import Path
def response(flow):
    location = "files/"+flow.request.host+flow.request.path
    Path(os.path.dirname(location) ).mkdir(parents=True, exist_ok=True)
    if location.endswith("/"):
        with open(location+"index.htm","wb") as f:
            f.write(flow.response.content)
    else:
        with open(location,"wb") as f:
            f.write(flow.response.content)       
eof

# Start proxy as user
 mitmdump --set confdir=. -s save.py


Add mitm certificate created above to the browser's trusted root CA store.
Configure your browser to connect via http proxy on localhost:8080, use the same proxy for https, and enjoy all files being dumped under ./files/domain/path.
:arrow: Note: It's probably a good idea to use a different browser for scraping the internet than for everyday use :!:


Bonus point: recovering a million pieces HLS video:
These come as playlists referring many many very short video files, fortunately we can fix it. First, prepare a list of input chunks for ffmpeg; each line should look like: file '/path/to/video/chunk'.
E.g. a playlist with relative paths can be converted just like that:
sed -e '/^#/ d; s/^/file /' < playlist.m3u8 > list.txt

Then have ffmpeg merge it:
ffmpeg -f concat -i list.txt -c copy output.mp4


GLHF!
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
Banana
Veteran
Veteran


Joined: 21 May 2004
Posts: 1392
Location: Germany

PostPosted: Mon Feb 26, 2024 12:01 pm    Post subject: Reply with quote

So, something like this https://docs.trafficserver.apache.org/index.html or this https://www.squid-cache.org/
_________________
My personal space
My delta-labs.org snippets do expire

PFL - Portage file list - find which package a file or command belongs to.
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3138

PostPosted: Mon Feb 26, 2024 12:43 pm    Post subject: Reply with quote

Kinda similar but not quite.
You _can_ script mitmproxy to act as a caching proxy (like in: returning the same content upon subsequent requests), but my version is not that smart.
I don't think squid can intercept encrypted traffic though, which makes it effectively useless those days. Mitmproxy can, as long as your browser accepts forged certificates. I suppose you could chain those 2, but that wasn't my goal.
Also, AFAIR squid used some kind of hashes for naming bits of data in its cache, so you'll have a hard time extracting it. I have mitmproxy store files under names which map directly to the source URLs. Much easier to use.
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum