curl-and-python

Re: Program dies on call of multi.select...

From: Sam's Lists <samslists_at_gmail.com>
Date: Fri, 15 Aug 2014 03:30:53 -0700

Hmmm...Is there any more information I should provide to get a response?

Or would I be better off asking on the general libcurl mailing list?

Anyone have a recommended next step in solving this problem?

Thanks!

On Tue, Aug 12, 2014 at 3:18 PM, Sam's Lists <samslists_at_gmail.com> wrote:

> Whoops, sorry that last email I hit send too quickly somehow. Here's what
> I meant to send:
>
> multi = pycurl.CurlMulti()
> now = datetime.datetime.utcnow()
> for counter, website in enumerate(websites, 1):
>
> website.grabber = WebSite.Resource(website.next_page.original_url)
> multi.add_handle(website.grabber._curl)
>
> while 1:
> ret, num_handles = multi.perform()
> if ret != pycurl.E_CALL_MULTI_PERFORM:
> break
>
>
> while num_handles:
> multi.select(30.0)
>
>
> This above multi.select line is where it dies about 70% of the time.
>
> Why might it die there? Again there are no exceptions printed, no stack
> traces, nothing.
>
> Could it have something do with a signal from the parent process?
>
> This is using Python 2.7 and Ubuntu 12.04.
>
> Pycurl is 7.19.5 and libcurl is 7.22.0-3ubuntu4.8
>
> Thanks!
>
>
>
>
> On Tue, Aug 12, 2014 at 3:09 PM, Sam's Lists <samslists_at_gmail.com> wrote:
>
>> I have a rather complicated crawler that seems to die often - but not
>> always at the same place.
>>
>> What's exasperating is that there is no exceptions, stack traces, etc.,
>> printed. I was only able to find where it died by adding lots of print
>> statements, and seeing what was the last thing to be printed.
>>
>> Here's a somewhat simplified version of the code:
>>
>> multi = pycurl.CurlMulti()
>> print("ag2")
>> now = datetime.datetime.utcnow()
>> print("ag3")
>> for counter, website in enumerate(websites, 1):
>> print("ag4")
>> assert website.crawl_type in ('standard', 'refresh', 'new')
>> print("ag5")
>> website.grabber = WebSite.Resource(website.next_page.original_url,
>> anonymous=Options.anonymous)
>> print("ag6")
>> website.next_page.crawled_ts = now
>> print("ag7")
>> multi.add_handle(website.grabber._curl)
>> print("ag8")
>>
>> print("ag9")
>> # Number of seconds to wait for a timeout to happen
>> if Options.test:
>> SELECT_TIMEOUT = 30.0 # Set for longer cause blicker_pierce
>> takes forever
>> # on the additional start page with
>> all the wines
>> else:
>> SELECT_TIMEOUT = 10.0
>> print("ag10")
>>
>> #To do: implement it this way
>> http://www.josefassad.com/pycurl_curlmulti_mini_howto
>> # Stir the state machine into action
>> while 1:
>> print("ag11")
>> ret, num_handles = multi.perform()
>> if ret != pycurl.E_CALL_MULTI_PERFORM:
>> break
>>
>> print("ag12")
>> #CauseError
>> # Keep going until all the connections have terminated
>> while num_handles:
>> # The select method uses fdset internally to determine which file
>> descriptors
>> # to check.
>>
>> # Todo: This code is looped a lot
>> # Should there be a sleep here???? I got no idea
>>
>> print("ag12.5")
>> print("calling multi.select with:", SELECT_TIMEOUT)
>> print("Please don't die here!!!!")
>> multi.select(SELECT_TIMEOUT)
>>
>
>

_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2014-08-15