basic python mechanize + beautifulsoup question

methodtwo · Apprentice Joined: 01 Feb 2008 Posts: 231

Hi there
I have some web scraping code, that uses python mechanise and BeautifulSoup. I need to feed the text(html) of a web page retrieved by mechanize,to BeautifulSoup. Whenever i copy and paste the html from "page source" in firefox the code works. But whenever i do:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

What is the output of diff -u comparing the file written by Mechanize with the file written by Firefox?

methodtwo · Apprentice Joined: 01 Feb 2008 Posts: 231

i've run:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

Without seeing at least a sample of the differences, it is hard to speculate. You might have differing line endings or other whitespace noise, which you can ask diff to suppress. You might be getting different pages depending on the headers sent, which would cause differences in the text. Your techniques vary in what is passed to BeautifulSoup, so some might work where others fail, even after you sort out the HTML difference.

methodtwo · Apprentice Joined: 01 Feb 2008 Posts: 231

Thanks for putting me on the right track. I'll send a sample of the diff output when i can figure out what is likely to be relevant. I got considerably less output when using the -b option. :oops:

methodtwo · Apprentice Joined: 01 Feb 2008 Posts: 231

I've tried all the various and sundry ways of making the soup and i don't understand how diff -u -b output would help if the difference is caused by not being able to get the html in the right layout? what other way could i pass the html to beautiful soup so that the rest of the app would work as though i had just copied-and-pasted the html into a file?
I just need to adjust the BeautifulSoup code that deals with processing the html content to accommodate for the fact that it's different when not copy-and-pasting! Of course why didn't i realise that before. goddamnit i'm slow on the uptake! I was getting distracted by the fact that there was a difference rather than just thinking "o.k i'll accommodate for the difference"
Also when do you need to do something like:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

Since we have not yet seen the output, it is not clear to me whether the manually saved file versus the mechanized file had contents that were even somewhat similar. The first order of business was to confirm that the mechanized file was well formed HTML. Your post rambles a bit, but I think you solved that part somehow.

With regard to your question, you do not decode the utf-8. You are decoding an incoming byte sequence as though it were utf-8 in order to get a Python string. You should only do this if you know the server is using utf-8 encoding for its pages. If it is using some other encoding, you may get a Unicode decoding error.

methodtwo · Apprentice Joined: 01 Feb 2008 Posts: 231

O.k so basically i'm supposed to login to the site and determine which slots in a table are available to be booked. The slots that are "available" have buttons in them(implemented as css classes(?)) so you can proceed to booking that time slot. When i copy-and-paste the html from firefox "page source" i can run this code:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

That sounds to me like there is something wrong with your mechanized request. The site may be varying its output based on request headers, so you should check that Mechanize sends exactly the same request options as Firefox. You may even need to match the user-agent.