Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
basic python mechanize + beautifulsoup question
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
methodtwo
Apprentice
Apprentice


Joined: 01 Feb 2008
Posts: 231

PostPosted: Sun Mar 23, 2014 4:22 pm    Post subject: basic python mechanize + beautifulsoup question Reply with quote

Hi there
I have some web scraping code, that uses python mechanise and BeautifulSoup. I need to feed the text(html) of a web page retrieved by mechanize,to BeautifulSoup. Whenever i copy and paste the html from "page source" in firefox the code works. But whenever i do:
Code:

file("my_htmlfile.txt","w").write(self.br.open(site_url+'page.aspx').read())
my_html = open('./my_htmlfile.txt', 'r')
soup = BeautifulSoup(my_html)

Or:
Code:
myfile = open('./script.html','w')
myfile.write(response.read())

Or:
Code:
soup = BeautifulSoup(response.get_data())

Then the code doesn't work, even though when i copy-and-paste from "page source" in firefox the code does work. I know you probably don't want to debug my whole thing for me. I was just asking incase there was anything obvious i was missing in terms of what i'm feeding to BeautifulSoup when i do it programatically?
Thank you for reading and for any replies i might get
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21607

PostPosted: Sun Mar 23, 2014 5:58 pm    Post subject: Reply with quote

What is the output of diff -u comparing the file written by Mechanize with the file written by Firefox?
Back to top
View user's profile Send private message
methodtwo
Apprentice
Apprentice


Joined: 01 Feb 2008
Posts: 231

PostPosted: Sun Mar 23, 2014 8:52 pm    Post subject: Reply with quote

i've run:
Code:
diff -u my_htmlfile.txt real_htmlfile.html

and there's a massive difference. Some javascript shows up as normal text in one of the versions, for example. The output is way too massive to paste into a forum post and i've never analysed diffs much before. Thank you very much for pointing me in the right direction. I'm sorry but i didn't know that there would be such a wild divergence between the two files when i first asked. There's literally hundreds upon hundreds of lines preceded with a + OR a -.
What would your guess be of why a copy-and-paste from firefox would generate such different html to one fetched by python mechanize and written to disk by python? Also of the methods i used in the original post which one looked like it should have definitely worked? I guess i can try diff -u with all the techniques i tried.
Thanks for your reply. I'm sure i'll get it eventually now. Sorry again that there's just a way too insane amount of output to paste into this post.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21607

PostPosted: Sun Mar 23, 2014 9:57 pm    Post subject: Reply with quote

Without seeing at least a sample of the differences, it is hard to speculate. You might have differing line endings or other whitespace noise, which you can ask diff to suppress. You might be getting different pages depending on the headers sent, which would cause differences in the text. Your techniques vary in what is passed to BeautifulSoup, so some might work where others fail, even after you sort out the HTML difference.
Back to top
View user's profile Send private message
methodtwo
Apprentice
Apprentice


Joined: 01 Feb 2008
Posts: 231

PostPosted: Mon Mar 24, 2014 12:03 am    Post subject: Reply with quote

Thanks for putting me on the right track. I'll send a sample of the diff output when i can figure out what is likely to be relevant. I got considerably less output when using the -b option. :oops:
Back to top
View user's profile Send private message
methodtwo
Apprentice
Apprentice


Joined: 01 Feb 2008
Posts: 231

PostPosted: Mon Mar 24, 2014 6:53 pm    Post subject: Reply with quote

I've tried all the various and sundry ways of making the soup and i don't understand how diff -u -b output would help if the difference is caused by not being able to get the html in the right layout? what other way could i pass the html to beautiful soup so that the rest of the app would work as though i had just copied-and-pasted the html into a file?
I just need to adjust the BeautifulSoup code that deals with processing the html content to accommodate for the fact that it's different when not copy-and-pasting! Of course why didn't i realise that before. goddamnit i'm slow on the uptake! I was getting distracted by the fact that there was a difference rather than just thinking "o.k i'll accommodate for the difference"
Also when do you need to do something like:
Code:
soup = BeautifulSoup(self.response.read().decode('utf-8'))

to decode the utf-8? i.e why do you need to decode utf-8?
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21607

PostPosted: Tue Mar 25, 2014 2:03 am    Post subject: Reply with quote

Since we have not yet seen the output, it is not clear to me whether the manually saved file versus the mechanized file had contents that were even somewhat similar. The first order of business was to confirm that the mechanized file was well formed HTML. Your post rambles a bit, but I think you solved that part somehow.

With regard to your question, you do not decode the utf-8. You are decoding an incoming byte sequence as though it were utf-8 in order to get a Python string. You should only do this if you know the server is using utf-8 encoding for its pages. If it is using some other encoding, you may get a Unicode decoding error.
Back to top
View user's profile Send private message
methodtwo
Apprentice
Apprentice


Joined: 01 Feb 2008
Posts: 231

PostPosted: Tue Mar 25, 2014 10:32 pm    Post subject: Reply with quote

O.k so basically i'm supposed to login to the site and determine which slots in a table are available to be booked. The slots that are "available" have buttons in them(implemented as css classes(?)) so you can proceed to booking that time slot. When i copy-and-paste the html from firefox "page source" i can run this code:
Code:
self.buttons = soup.findAll(attrs={'class': 'buttonwrapper'})
self.coords = self.get_coords(self.buttons)
def get_coords(self, buttons):
   for x in buttons:
      
      soup3 = BeautifulSoup(str(x))
      try:
            partial = soup3.find(attrs={'value': "Partial"})
      
            if partial is not None:
               print partial
                      self.par += 1
               #partial = None
      except:
            pass#print 'My bad'
      if partial == None:
      
            try:
               available = soup3.find(attrs={'value': "Available"})
               print available

               self.avails += 1
               self.par_temp = self.par + self.avails
               self.avail.append(self.par_temp)
            except:
               pass #print 'Can\'t get either div class'
   return self.avail

And the output indicates that the found "buttonwrapper" classes are in the same order as indicated by looking at the html on the real web site. However when i grab the html programatically with:
Code:
soup = BeautifulSoup(self.response.get_data())

Then the output of the above code would be something like:
Code:

<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn0" name="ctl00$MainContent$cal$calbtn0" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn1" name="ctl00$MainContent$cal$calbtn1" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn2" name="ctl00$MainContent$cal$calbtn2" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn3" name="ctl00$MainContent$cal$calbtn3" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn4" name="ctl00$MainContent$cal$calbtn4" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" id="ctl00_MainContent_cal_calbtn5" name="ctl00$MainContent$cal$calbtn5" style="width:100%;" type="submit" value="Available"/>

The key thing to note is that the first slot with a value of "Available" should be the 5th one, for example. The order is different when the html is got programmatically with python mechanize feeding it to BeautifulSoup. So the order of the style classes are different and thus the code is broken. It needs to work out which are available but it always gets it wrong due to the order being different as a result of fetching the html with mechanize(as opposed to copy-and-pasting with firefox). The diff output just shows the whole chunk of button and time classes as all different as a whole. Thus i really don't know what to do.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21607

PostPosted: Wed Mar 26, 2014 1:56 am    Post subject: Reply with quote

That sounds to me like there is something wrong with your mechanized request. The site may be varying its output based on request headers, so you should check that Mechanize sends exactly the same request options as Firefox. You may even need to match the user-agent.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum