curl-and-python

Re: Program dies on call of multi.select...

From: Sam's Lists <samslists_at_gmail.com>
Date: Tue, 12 Aug 2014 15:18:14 -0700

Whoops, sorry that last email I hit send too quickly somehow. Here's what I
meant to send:

    multi = pycurl.CurlMulti()
    now = datetime.datetime.utcnow()
    for counter, website in enumerate(websites, 1):

        website.grabber = WebSite.Resource(website.next_page.original_url)
        multi.add_handle(website.grabber._curl)

    while 1:
        ret, num_handles = multi.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break

    while num_handles:
        multi.select(30.0)

This above multi.select line is where it dies about 70% of the time.

Why might it die there? Again there are no exceptions printed, no stack
traces, nothing.

Could it have something do with a signal from the parent process?

This is using Python 2.7 and Ubuntu 12.04.

Pycurl is 7.19.5 and libcurl is 7.22.0-3ubuntu4.8

Thanks!

On Tue, Aug 12, 2014 at 3:09 PM, Sam's Lists <samslists_at_gmail.com> wrote:

> I have a rather complicated crawler that seems to die often - but not
> always at the same place.
>
> What's exasperating is that there is no exceptions, stack traces, etc.,
> printed. I was only able to find where it died by adding lots of print
> statements, and seeing what was the last thing to be printed.
>
> Here's a somewhat simplified version of the code:
>
> multi = pycurl.CurlMulti()
> print("ag2")
> now = datetime.datetime.utcnow()
> print("ag3")
> for counter, website in enumerate(websites, 1):
> print("ag4")
> assert website.crawl_type in ('standard', 'refresh', 'new')
> print("ag5")
> website.grabber = WebSite.Resource(website.next_page.original_url,
> anonymous=Options.anonymous)
> print("ag6")
> website.next_page.crawled_ts = now
> print("ag7")
> multi.add_handle(website.grabber._curl)
> print("ag8")
>
> print("ag9")
> # Number of seconds to wait for a timeout to happen
> if Options.test:
> SELECT_TIMEOUT = 30.0 # Set for longer cause blicker_pierce
> takes forever
> # on the additional start page with
> all the wines
> else:
> SELECT_TIMEOUT = 10.0
> print("ag10")
>
> #To do: implement it this way
> http://www.josefassad.com/pycurl_curlmulti_mini_howto
> # Stir the state machine into action
> while 1:
> print("ag11")
> ret, num_handles = multi.perform()
> if ret != pycurl.E_CALL_MULTI_PERFORM:
> break
>
> print("ag12")
> #CauseError
> # Keep going until all the connections have terminated
> while num_handles:
> # The select method uses fdset internally to determine which file
> descriptors
> # to check.
>
> # Todo: This code is looped a lot
> # Should there be a sleep here???? I got no idea
>
> print("ag12.5")
> print("calling multi.select with:", SELECT_TIMEOUT)
> print("Please don't die here!!!!")
> multi.select(SELECT_TIMEOUT)
>

_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2014-08-13