cURL / Mailing Lists / curl-library / Single Mail

curl-library

RE: "pull" aspect of multi interface not quite working properly

From: Allen Pulsifer <pulsifer3_at_comcast.net>
Date: Thu, 21 Jun 2007 18:12:30 -0400

> > I had assumed that curl_multi_perform would call the
> > CURLOPT_WRITEFUNCTION
> > function at most once every time it is called. This
> ensures that my
> > application never has to deal with more than
> CURL_MAX_WRITE_SIZE bytes until
> > it is ready for more.
>
> But why would it be limited to that? What says it can't
> receive more than
> CURL_MAX_WRITE_SIZE from the peer and if it has received the
> data already,
> why shouldn't it deliver it to the app?

Why? Because the app is not ready for it. Resources on the computer are
limited, among other things, buffer space. It will often by more efficient
to process data on-the-fly than to be forced to write it to disk then read
it back when you are ready to process it. This is what happens via paging
if you malloc too much memory. Consistent with the TCP/IP method of flow
control, we only want the origin server, and by extension the OS and the
library to send us data when we are ready for it. This is how the "read"
system call works. You only call the read function when your application
wants data. Read will never send you more than you ask for, even if it has
more data in the socket buffer. It won't deliver it just because it has it,
it will only deliver it when you ask for it. This is what a "pull"
interface means, and this is what I'm looking for. If libcurl on the other
hand sends you data even when you don't want it, or sends you more data than
you ask for, then it is no longer a pull interface, it is a push interface,
and the statement that the multi interface is a "pull" interface is not
correct (see http://curl.haxx.se/libcurl/c/libcurl-multi.html).

> > The problem is that I have seen curl_multi_perform call
> > CURLOPT_WRITEFUNCTION more than once per invocation,
> delivering in total
> > more than CURL_MAX_WRITE_SIZE bytes. So far in testing, it
> has sometimes
> > called it twice, but there seems to be no guarantee. If can call
> > CURLOPT_WRITEFUNCTION more than once, it could conceivably call
> > CURLOPT_WRITEFUNCTION an indeterminate number of times,
> essentially breaking
> > the "pull" aspect of the interface.
>
> I don't see how it breaks the pull aspect as it does this
> only when you ask
> for it and it stops when there's nothing left to pull. I
> understand it breaks
> the way you think of the concept, but I don't think I
> completely agree with
> that.

It breaks the "pull" aspect because the application cannot control the flow
of data. The application can say "send me data now", and in reply, libcurl
might send 200 GB of data. If you can't control that amount of data you
get, then it is not a "pull" interface. Its a "push-when-I-say-go"
interface, because the library is in fact pushing as much data as it wants,
but only when the application says "go".

> > So in summary, the problem is that curl_multi_perform
> sometimes calls
> > CURLOPT_WRITEFUNCTION more than once each time it is
> called, which delivers
> > in total more than CURL_MAX_WRITE_SIZE and overwhelms my
> application with
> > data. The solution would be to ensure that
> curl_multi_perform can call
> > CURLOPT_WRITEFUNCTION at most one time before returning.
>
> In my view it is more about being able to better control how
> much data libcurl
> should be allowed to deliver to the application.

Yes, that is part of what makes a pull interface work, and it is consistent
with how TCP/IP and the read system call work.

> You consider a callback with the maximum amount of data to be
> what you can
> deal with, but I figure there might be other users who would
> rather prefer to
> control it to deliver even smaller pieces at a time if they could...

That can be set via CURLOPT_BUFFERSIZE. I realize that in non-committal
fashion, the documentation says "This is just treated as a request, not an
order. You cannot be guaranteed to actually get the given size". But in
practice, it appears that the maximum size per call to CURLOPT_WRITEFUNCTION
will never be larger than CURLOPT_BUFFERSIZE (except possibly in some
situations that don't apply to my application; I didn't read the code that
closely where it was not relevant to the problem I'm working on). If it is
really important for your application, you can also change
CURL_MAX_WRITE_SIZE and recompile.

In summary, flow control at some level critical for every application.
Applications with a "give-me-as-much-data-as-fast-as-you-can" philosophy
only work while the data rate is slow and the processing rate (for example,
save to disk) is fast. But as soon as you put in a 1 GB/s network link and
put in some significant processing or start forwarding the data to slower
machines, flow control is critical. If the multi interface is truly going
to provide a "pull" interface, it has to allow the application to determine
when and how much data is send.
Received on 2007-06-22