Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Downloading files from the ÖNB
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
Boccaccio
Apprentice
Apprentice


Joined: 19 Jul 2005
Posts: 286

PostPosted: Tue Jan 01, 2013 10:10 pm    Post subject: Downloading files from the ÖNB Reply with quote

Dear all,

I try to automatically download digitized manuscripts from the Austrian National Library (ÖNB). The big problem is that they seem to name their jpeg files in a random manner as if they wanted to prevent people from getting these public domain files or offline usage.

Starting point is a page like this one. Klicking on "Digitalisat" opens a new windows which displays one out of ~300 jpeg files which I could then manually download. In order to get this automatically done, I tried to use wget:

Code:

wget -r  --referer='http://aleph.onb.ac.at/F/7NH117C1MN83K28KUN2H7DR9Q5QCVSC8VRER4E7LQD1XED27KG-01111?func=full-set-set&set_number=009884&set_entry=000007&format=999' 'http://archiv.onb.ac.at:1801/webclient/DeliveryManager?pid=3050247&custom_att_2=simple_viewer'


which downloads a bunch of php and javascript files. Opening the php files yields something similar to the new window mentioned above, except for the missing jpeg file. Is there any body who has an idea of how to get the files?

Thanks in advance!
Back to top
View user's profile Send private message
CrankyPenguin
Apprentice
Apprentice


Joined: 19 Jun 2003
Posts: 283

PostPosted: Tue Jan 08, 2013 12:50 am    Post subject: Reply with quote

You might have to scrape it then. If they name the jpeg files randomly they must have some part of the site that generates the ordering (i.e. dynamic links) If you can pull and parse that you might be able to develop a nice python script that pulls the items serially say going to the first page, looking for the 'next page' link (or something similar) and pulling that etc.
_________________
Linux, the OS for the obsessive-compulsive speed freak in all of us.
Back to top
View user's profile Send private message
Boccaccio
Apprentice
Apprentice


Joined: 19 Jul 2005
Posts: 286

PostPosted: Tue Jan 08, 2013 8:55 am    Post subject: Reply with quote

In the meantime, I just wrote an email and asked why there is no download available and why they name the files in a nonsystematic manner. As an answer, I was told that there is no download because they offer public domain and non-public domain files (don't ask me why they can offer protected files on the net...). The naming scheme is not completely random as I supposed but was necessary for their internal working procedure.

Looking at the various javascripts on the website, I could not find anything that gives me a hint about how the next jpg is found and displayed. So as a temporary solution, I used xdotools to write a script that clicks to the next side every few second such that in the end of the day I just have to collect the files from the browser cache.
Back to top
View user's profile Send private message
CrankyPenguin
Apprentice
Apprentice


Joined: 19 Jun 2003
Posts: 283

PostPosted: Wed Jan 09, 2013 7:13 am    Post subject: Reply with quote

Interesting solution. I find that screen scraping is harder now with javascript-heavy sites. How do you find the images in your cache?
_________________
Linux, the OS for the obsessive-compulsive speed freak in all of us.
Back to top
View user's profile Send private message
Boccaccio
Apprentice
Apprentice


Joined: 19 Jul 2005
Posts: 286

PostPosted: Wed Jan 09, 2013 7:16 am    Post subject: Reply with quote

I use chrome, empty the browser cache before starting and then have all of the jpegs nicely ordered in the cache directory.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum