cURL / Mailing Lists / curl-library / Single Mail

curl-library

truncated files

From: Giuseppe Attardi <attardi_at_di.unipi.it>
Date: Sun, 07 Apr 2013 12:22:37 +0200

I am experiencing a problem that is puzzling me.

I am running a crawl, using libevent 2 to handle multiple simultaneuos
connections.
The settings for the easy handle are:

   curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, 30);
   curl_easy_setopt(curl, CURLOPT_TIMEOUT, 300);
   curl_easy_setopt(curl, CURLOPT_LOW_SPEED_LIMIT, 500);
   curl_easy_setopt(curl, CURLOPT_LOW_SPEED_TIME, 20);
   curl_easy_setopt(curl, CURLOPT_HTTP_CONTENT_DECODING, 0);
   curl_easy_setopt(curl, CURLOPT_HTTP_TRANSFER_DECODING, 1);

Besides I ask for compressed transfer:
   curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip");
I configured curl --without-zlib, because I want to save the file zipped.
The transfer callback (CURLOPT_WRITEFUNCTION) just copies the received
bytes to a local file.

Everything works fine, except that occasionally some files are received
truncated.
curl does not report any error on those files.
The problem only occurrs when a large number (~ 1000) of files are
requested, and it happens occasionally for different files.
It is quite hard to replicate and hence to track.

The event scheme is similar to hiperfifo2.cpp
(http://curl.haxx.se/mail/lib-2012-12/att-0243/hiperfifo2.cpp)
except that the Retriever runs in its own thread, fed by another thread
which extract links from pages.
The Retriever runs a

     event_base_dispatch(base);

The timer associated to the multi_handle is this:

static void multi_timer_cb(CURLM* multi, long timeout_ms, Retriever* retr)
{
   if (timeout_ms == -1)
     return; // no timeout
   struct timeval timeout;
   timeout.tv_sec = timeout_ms / 1000;
   timeout.tv_usec = (timeout_ms % 1000) * 1000;
   evtimer_add(&retr->timer_event, &timeout);
}

which triggers the following callback:

static void timer_cb(int fd, short kind, void* arg)
{
   Retriever* retr = (Retriever*)arg;
   retr->action(0, CURL_SOCKET_TIMEOUT);
}

which in turn does this:

void Retriever::action(int action, int fd)
{
   int running;
   CURLMcode code = curl_multi_socket_action(multi_handle, fd, action,
&running);
   int msgs;
   do {
     CURLMsg* msg = curl_multi_info_read(multi_handle, &msgs);
     if (msg && msg->msg == CURLMSG_DONE)
       transferCompleted(msg->easy_handle, msg->data.result);
   } while (msgs);
}

which is also invoked when there is activity on a socket:

static void socket_cb(CURL* e, curl_socket_t s, int what, Retriever*
retr, void* sockp)
{
   event* ev;
   curl_easy_getinfo(e, CURLINFO_PRIVATE, &ev);
   if (what == CURL_POLL_REMOVE)
     event_del(ev);
   else
     retr->activate(ev, s, what);
}

For the files that get truncated curl_multi_info_read() is called and
returns CURLMSG_DONE.
So my guess is that some previous event was missed.

Here is the trace of callbacks for one of those files:

GET http://marinomarina.edublogs.org/feed/ -> zero/Cache/0000/00/e4
SOCKETFUNCTION 274 CURL_POLL_IN
WRITEFUNCTION 2437 bytes
SOCKETFUNCTION 274 CURL_POLL_REMOVE

As we see, the write callback is called only once. The file compressed
is 5533 bytes long.

Could you suggest what could cause missing an event?

Any advice would be appreciated.

-- Beppe Attardi

-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2013-04-07