View previous topic :: View next topic |
Author |
Message |
szatox Advocate
Joined: 27 Aug 2013 Posts: 3404
|
Posted: Mon Feb 26, 2024 2:38 am Post subject: Trick to save resource files from visited websites |
|
|
I've been looking for some way to save files hidden behind javascript viewers. Here is what I eventually came up with.
The solution below is not complete, it is not perfect, and it is not even the most convenient thing in the world, but it can follow you to the more obscure websites which don't have dedicated scrapers, which is way better than nothing at all.
How it works:
We connect to the websites via a proxy which snoops on our traffic and dumps responses on disk.
What do we need:
mitmproxy
custom script for saving files (mitmproxy comes with a bunch of provided scrips; haven't found a suitable one)
custom CA certificate
How to do set fings up:
Code: | # Setup as root:
emerge mitmproxy
# Setup as user
# Create CA cert and bundle it for use with mitmproxy
openssl req -x509 -sha256 -days 30 -newkey rsa:2048 -keyout mitm-ca.key -out mitm-ca.pem -nodes # you can make the cert valid for any time you want, I chose to limit potential damage
cat mitm-ca.key mitm-ca.pem > mitmproxy-ca.pem
# Create a plugin for mitmproxy
cat > save.py << eof
import mitmproxy
import os
from pathlib import Path
def response(flow):
location = "files/"+flow.request.host+flow.request.path
Path(os.path.dirname(location) ).mkdir(parents=True, exist_ok=True)
if location.endswith("/"):
with open(location+"index.htm","wb") as f:
f.write(flow.response.content)
else:
with open(location,"wb") as f:
f.write(flow.response.content)
eof
# Start proxy as user
mitmdump --set confdir=. -s save.py
|
Add mitm certificate created above to the browser's trusted root CA store.
Configure your browser to connect via http proxy on localhost:8080, use the same proxy for https, and enjoy all files being dumped under ./files/domain/path.
Note: It's probably a good idea to use a different browser for scraping the internet than for everyday use
Bonus point: recovering a million pieces HLS video:
These come as playlists referring many many very short video files, fortunately we can fix it. First, prepare a list of input chunks for ffmpeg; each line should look like: file '/path/to/video/chunk'.
E.g. a playlist with relative paths can be converted just like that:
sed -e '/^#/ d; s/^/file /' < playlist.m3u8 > list.txt
Then have ffmpeg merge it:
ffmpeg -f concat -i list.txt -c copy output.mp4
GLHF! _________________ Make Computing Fun Again |
|
Back to top |
|
|
Banana Moderator
Joined: 21 May 2004 Posts: 1709 Location: Germany
|
|
Back to top |
|
|
szatox Advocate
Joined: 27 Aug 2013 Posts: 3404
|
Posted: Mon Feb 26, 2024 12:43 pm Post subject: |
|
|
Kinda similar but not quite.
You _can_ script mitmproxy to act as a caching proxy (like in: returning the same content upon subsequent requests), but my version is not that smart.
I don't think squid can intercept encrypted traffic though, which makes it effectively useless those days. Mitmproxy can, as long as your browser accepts forged certificates. I suppose you could chain those 2, but that wasn't my goal.
Also, AFAIR squid used some kind of hashes for naming bits of data in its cache, so you'll have a hard time extracting it. I have mitmproxy store files under names which map directly to the source URLs. Much easier to use. _________________ Make Computing Fun Again |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|