View previous topic :: View next topic |
Author |
Message |
methodtwo Apprentice
Joined: 01 Feb 2008 Posts: 231
|
Posted: Sun Mar 23, 2014 4:22 pm Post subject: basic python mechanize + beautifulsoup question |
|
|
Hi there
I have some web scraping code, that uses python mechanise and BeautifulSoup. I need to feed the text(html) of a web page retrieved by mechanize,to BeautifulSoup. Whenever i copy and paste the html from "page source" in firefox the code works. But whenever i do:
Code: |
file("my_htmlfile.txt","w").write(self.br.open(site_url+'page.aspx').read())
my_html = open('./my_htmlfile.txt', 'r')
soup = BeautifulSoup(my_html)
|
Or:
Code: | myfile = open('./script.html','w')
myfile.write(response.read()) |
Or:
Code: | soup = BeautifulSoup(response.get_data()) |
Then the code doesn't work, even though when i copy-and-paste from "page source" in firefox the code does work. I know you probably don't want to debug my whole thing for me. I was just asking incase there was anything obvious i was missing in terms of what i'm feeding to BeautifulSoup when i do it programatically?
Thank you for reading and for any replies i might get |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21607
|
Posted: Sun Mar 23, 2014 5:58 pm Post subject: |
|
|
What is the output of diff -u comparing the file written by Mechanize with the file written by Firefox? |
|
Back to top |
|
|
methodtwo Apprentice
Joined: 01 Feb 2008 Posts: 231
|
Posted: Sun Mar 23, 2014 8:52 pm Post subject: |
|
|
i've run:
Code: | diff -u my_htmlfile.txt real_htmlfile.html |
and there's a massive difference. Some javascript shows up as normal text in one of the versions, for example. The output is way too massive to paste into a forum post and i've never analysed diffs much before. Thank you very much for pointing me in the right direction. I'm sorry but i didn't know that there would be such a wild divergence between the two files when i first asked. There's literally hundreds upon hundreds of lines preceded with a + OR a -.
What would your guess be of why a copy-and-paste from firefox would generate such different html to one fetched by python mechanize and written to disk by python? Also of the methods i used in the original post which one looked like it should have definitely worked? I guess i can try diff -u with all the techniques i tried.
Thanks for your reply. I'm sure i'll get it eventually now. Sorry again that there's just a way too insane amount of output to paste into this post. |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21607
|
Posted: Sun Mar 23, 2014 9:57 pm Post subject: |
|
|
Without seeing at least a sample of the differences, it is hard to speculate. You might have differing line endings or other whitespace noise, which you can ask diff to suppress. You might be getting different pages depending on the headers sent, which would cause differences in the text. Your techniques vary in what is passed to BeautifulSoup, so some might work where others fail, even after you sort out the HTML difference. |
|
Back to top |
|
|
methodtwo Apprentice
Joined: 01 Feb 2008 Posts: 231
|
Posted: Mon Mar 24, 2014 12:03 am Post subject: |
|
|
Thanks for putting me on the right track. I'll send a sample of the diff output when i can figure out what is likely to be relevant. I got considerably less output when using the -b option. |
|
Back to top |
|
|
methodtwo Apprentice
Joined: 01 Feb 2008 Posts: 231
|
Posted: Mon Mar 24, 2014 6:53 pm Post subject: |
|
|
I've tried all the various and sundry ways of making the soup and i don't understand how diff -u -b output would help if the difference is caused by not being able to get the html in the right layout? what other way could i pass the html to beautiful soup so that the rest of the app would work as though i had just copied-and-pasted the html into a file?
I just need to adjust the BeautifulSoup code that deals with processing the html content to accommodate for the fact that it's different when not copy-and-pasting! Of course why didn't i realise that before. goddamnit i'm slow on the uptake! I was getting distracted by the fact that there was a difference rather than just thinking "o.k i'll accommodate for the difference"
Also when do you need to do something like:
Code: | soup = BeautifulSoup(self.response.read().decode('utf-8')) |
to decode the utf-8? i.e why do you need to decode utf-8? |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21607
|
Posted: Tue Mar 25, 2014 2:03 am Post subject: |
|
|
Since we have not yet seen the output, it is not clear to me whether the manually saved file versus the mechanized file had contents that were even somewhat similar. The first order of business was to confirm that the mechanized file was well formed HTML. Your post rambles a bit, but I think you solved that part somehow.
With regard to your question, you do not decode the utf-8. You are decoding an incoming byte sequence as though it were utf-8 in order to get a Python string. You should only do this if you know the server is using utf-8 encoding for its pages. If it is using some other encoding, you may get a Unicode decoding error. |
|
Back to top |
|
|
methodtwo Apprentice
Joined: 01 Feb 2008 Posts: 231
|
Posted: Tue Mar 25, 2014 10:32 pm Post subject: |
|
|
O.k so basically i'm supposed to login to the site and determine which slots in a table are available to be booked. The slots that are "available" have buttons in them(implemented as css classes(?)) so you can proceed to booking that time slot. When i copy-and-paste the html from firefox "page source" i can run this code:
Code: | self.buttons = soup.findAll(attrs={'class': 'buttonwrapper'})
self.coords = self.get_coords(self.buttons)
def get_coords(self, buttons):
for x in buttons:
soup3 = BeautifulSoup(str(x))
try:
partial = soup3.find(attrs={'value': "Partial"})
if partial is not None:
print partial
self.par += 1
#partial = None
except:
pass#print 'My bad'
if partial == None:
try:
available = soup3.find(attrs={'value': "Available"})
print available
self.avails += 1
self.par_temp = self.par + self.avails
self.avail.append(self.par_temp)
except:
pass #print 'Can\'t get either div class'
return self.avail
|
And the output indicates that the found "buttonwrapper" classes are in the same order as indicated by looking at the html on the real web site. However when i grab the html programatically with:
Code: | soup = BeautifulSoup(self.response.get_data())
|
Then the output of the above code would be something like:
Code: |
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn0" name="ctl00$MainContent$cal$calbtn0" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn1" name="ctl00$MainContent$cal$calbtn1" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn2" name="ctl00$MainContent$cal$calbtn2" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn3" name="ctl00$MainContent$cal$calbtn3" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" disabled="disabled" id="ctl00_MainContent_cal_calbtn4" name="ctl00$MainContent$cal$calbtn4" style="width:100%;" type="submit" value="Partial"/>
<input class="removeUnderLineAvailable" id="ctl00_MainContent_cal_calbtn5" name="ctl00$MainContent$cal$calbtn5" style="width:100%;" type="submit" value="Available"/> |
The key thing to note is that the first slot with a value of "Available" should be the 5th one, for example. The order is different when the html is got programmatically with python mechanize feeding it to BeautifulSoup. So the order of the style classes are different and thus the code is broken. It needs to work out which are available but it always gets it wrong due to the order being different as a result of fetching the html with mechanize(as opposed to copy-and-pasting with firefox). The diff output just shows the whole chunk of button and time classes as all different as a whole. Thus i really don't know what to do. |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21607
|
Posted: Wed Mar 26, 2014 1:56 am Post subject: |
|
|
That sounds to me like there is something wrong with your mechanized request. The site may be varying its output based on request headers, so you should check that Mechanize sends exactly the same request options as Firefox. You may even need to match the user-agent. |
|
Back to top |
|
|
|