View previous topic :: View next topic |
Author |
Message |
Boccaccio Apprentice
Joined: 19 Jul 2005 Posts: 286
|
Posted: Tue Jan 01, 2013 10:10 pm Post subject: Downloading files from the ÖNB |
|
|
Dear all,
I try to automatically download digitized manuscripts from the Austrian National Library (ÖNB). The big problem is that they seem to name their jpeg files in a random manner as if they wanted to prevent people from getting these public domain files or offline usage.
Starting point is a page like this one. Klicking on "Digitalisat" opens a new windows which displays one out of ~300 jpeg files which I could then manually download. In order to get this automatically done, I tried to use wget:
Code: |
wget -r --referer='http://aleph.onb.ac.at/F/7NH117C1MN83K28KUN2H7DR9Q5QCVSC8VRER4E7LQD1XED27KG-01111?func=full-set-set&set_number=009884&set_entry=000007&format=999' 'http://archiv.onb.ac.at:1801/webclient/DeliveryManager?pid=3050247&custom_att_2=simple_viewer'
|
which downloads a bunch of php and javascript files. Opening the php files yields something similar to the new window mentioned above, except for the missing jpeg file. Is there any body who has an idea of how to get the files?
Thanks in advance! |
|
Back to top |
|
|
CrankyPenguin Apprentice
Joined: 19 Jun 2003 Posts: 283
|
Posted: Tue Jan 08, 2013 12:50 am Post subject: |
|
|
You might have to scrape it then. If they name the jpeg files randomly they must have some part of the site that generates the ordering (i.e. dynamic links) If you can pull and parse that you might be able to develop a nice python script that pulls the items serially say going to the first page, looking for the 'next page' link (or something similar) and pulling that etc. _________________ Linux, the OS for the obsessive-compulsive speed freak in all of us. |
|
Back to top |
|
|
Boccaccio Apprentice
Joined: 19 Jul 2005 Posts: 286
|
Posted: Tue Jan 08, 2013 8:55 am Post subject: |
|
|
In the meantime, I just wrote an email and asked why there is no download available and why they name the files in a nonsystematic manner. As an answer, I was told that there is no download because they offer public domain and non-public domain files (don't ask me why they can offer protected files on the net...). The naming scheme is not completely random as I supposed but was necessary for their internal working procedure.
Looking at the various javascripts on the website, I could not find anything that gives me a hint about how the next jpg is found and displayed. So as a temporary solution, I used xdotools to write a script that clicks to the next side every few second such that in the end of the day I just have to collect the files from the browser cache. |
|
Back to top |
|
|
CrankyPenguin Apprentice
Joined: 19 Jun 2003 Posts: 283
|
Posted: Wed Jan 09, 2013 7:13 am Post subject: |
|
|
Interesting solution. I find that screen scraping is harder now with javascript-heavy sites. How do you find the images in your cache? _________________ Linux, the OS for the obsessive-compulsive speed freak in all of us. |
|
Back to top |
|
|
Boccaccio Apprentice
Joined: 19 Jul 2005 Posts: 286
|
Posted: Wed Jan 09, 2013 7:16 am Post subject: |
|
|
I use chrome, empty the browser cache before starting and then have all of the jpegs nicely ordered in the cache directory. |
|
Back to top |
|
|
|