cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: Avoiding creation of multiple connections to an http capable server before CURLMOPT_MAX_PIPELINE_LENGTH is reached.

From: Carlo Wood <carlo_at_alinoe.com>
Date: Fri, 31 Oct 2014 14:57:25 +0100

On Fri, 31 Oct 2014 12:01:09 +0100 (CET)
Daniel Stenberg <daniel_at_haxx.se> wrote:

> Ah yes. This setting was introduced when someone wanted a more fixed
> limit independent of the pipelning working or not against servers.
> Limiting connections per host is also a way to follow the HTTP spec
> and to not have services to a single host "starve out" communication
> with other hosts by the shear amount of traffic.
>
> Already in the past I've pondered if it would be better to somehow
> have a callback or something to send a lot of info to and to have
> that decide action in several of these cases as just adding more and
> more options also make things very complicated and hard to use. And
> tricky to understand how things work together.
>
> CURLMOPT_MAX_HOST_PIPELNED_CONNECTIONS feels a bit... long. =) Also a
> bit hard to explain exactly what it limits and when each option is in
> use.

Well, it all comes down to if you want to support using the same
multi-handle for different servers or not. My application, written when
I didn't have to support pipelining at all, would need to be almost
rewritten when I want to use a multi handle per server :/.

In theory, one would want to be able to set a limit per service
(host:port) as that is gives the most flexibility. In general (or
practice) you can get away with setting a global limit, that is - until
http pipelining was added to libcurl. The demands on the number of
connection per server for a pipelining capable server and one that
doesn't support pipelining is just too large to be covered by a single
limit.

If I had a CURLMOPT_MAX_HOST_PIPELINED_CONNECTIONS then I'd be partly
happy, although it wouldn't solve the real problem described here: when
at 'random' times libcurl forgets that a server can do pipelining, it
also wouldn't known whether to use CURLMOPT_MAX_HOST_CONNECTIONS or
CURLMOPT_MAX_HOST_PIPELINED_CONNECTIONS...

The only way that this would be useful therefore is when the user
is able to set a limit on the number of connections for a *specific*
host:port, where we are assuming that the application knows which
host:port supports pipelining :/.
 
[...]
> It already did have a pretty good control of the number of
> connections as it could set a max (both total and per host). You're
> not increasing the control here, you're changing the ways to control
> it.

See above. The user doesn't have this control unless
1) they known which urls support pipelining.
2) they use a dedicated multi handle for those urls (and another multi
handle for the other urls).

Only then they can set a different limit for the number of
max. connections for pipelining and non-pipelining. I hope you agree
that is reasonable to demand that libcurl supports having different
limits for this: pipelining usually requires less connections, the
requirements are just too different to be covered by a single setting.

However, curl would choose to demand this approach (using a dedicated
multi handle) then that implies that the user knows which urls support
pipelining. And when we assume that then imho it would be a small step
to support the same functionality without requiring a separate multi
handle: just allow the user to specify what they want on a per-host
basis. Internally there is no need for separate multi handle at all.
The only reason that seems necessary is because currently the limit on
the maximum number of connections per host can only be set on a per
multi handle basis.

However, libcurl doesn't have the means, currently, to set options on a
per site (host:port) basis, hence that I've suggested to add it on a
per easy handle basis :/ as that is better supported in the current API.
 
> > If -say- it wants to use the benefit of pipelining, which means
> > only using a single socket (connection), by limiting the number of
> > connections to only ONE connection, then all it has to take care of
> > is to never add more than CURLMOPT_MAX_PIPELINE_LENGTH easy handles
> > to the multi handle at the same time.
>
> That would be very inconvenient for most applications though as they
> normally don't even care or know which URLs that may end up on the
> same pipeline (or not). Also, that would force them to have to drain
> the pipeline first before they can add more transfers which is
> ineffective.

If that is typical then I see no other way then to specify the
(different) limits of pipeline capable and non-pipeline capable anyway,
hence introduce a CURLMOPT_MAX_HOST_PIPELINED_CONNECTIONS. But then what
to do as long as curl doesn't know yet if a connection supports
pipelining? That would result in having to limit the number of
connections to the minimum of the two limits no?

This is not that bad! If a user doesn't want this, then have the
following options:

- Set CURLMOPT_MAX_HOST_PIPELINED_CONNECTIONS and
  CURLMOPT_MAX_HOST_CONNECTIONS to the same value (which would give the
  current behavior).
- Use a dedicated multi handle for connections that might support
  pipelining (that probably support it), and do not set
  CURLMOPT_PIPELINING on the other multi handle(s), causing those to
  use CURLMOPT_MAX_HOST_CONNECTIONS as the limit regardless.

From the current situation, this only improves things.
The ONLY other alternative algorithm is use the maximum of the two
limits as long as it is not known if pipelining is supported, and
as soon as it is and the smaller limit applies - disconnect the excess
of connections. If that seems like a use case that needs to be
supported then that would be possible too by adding a new CURLMOPT
that allows the user to specify this strategy.

[...]
> The application can control the maximum number of connections to each
> host and the maximum number of connections used in total. That's on
> purpose and by design. If they don't work then we have a bug.

But more flexibility is needed - see the arguments above.

[...]
> This is only retried if that reset happens immediately after when a
> request has been issued, since that's then usually the result of a
> persistant re-used connection having been closed by the other end. A
> connection that just gets reset somewhere in the middle will not be
> "retried" in any way but is a mere transfer failure.

Ok, that is a separate case then (ie, case 4).

> > In all cases, since there is only a single connection to that
> > server (as desired)
>
> As desired in your case. Pipelining users I've talked to before used
> pipelining to maximize HTTP transfer performance, and that usually
> means more than one connection.

Just to check we're on the same page. Usually when I work on a library
(or protocol design) the rule of thumb is that if there is even ONE
reasonable use case thinkable then it has to be supported (if possible).
Sometimes I even go further and say that if something can be supported
as a thought experiment then it should be supported even if you can't
think of a good reason why someone would want that: there are so many
people out there that the developer/designer can impossibly dictate on
all of them what they think is "reasonable". The user will have to
decide for themselves. The library just needs to give them the option.

You keep saying that "this applies to you, but...". Imho, that means
that it should be supported, no? Besides, my use case might be the only
case that you heard of, but it even is a very reasonable one.

If despite that you are not willing to support having different limits
for the number of connections to a specific host (unless the user works
around that limitation by using a dedicated multi handle) then please
tell me now, so I can start to work on adding support for multiple
multi handles to my application right away :/

> > closing such a connection currently causes the bundle for that site
> > to be deleted, and hence for libcurl to forget that that site is
> > http pipeline capable. After this, since in all cases the "pipe
> > broke", all not-timed-out requests in the pipe are being internally
> > retried, all instantly-- before the server replies to the first--
> > hence each resulting in a new connection and the length of the pipe
> > (CURLMOPT_MAX_PIPELINE_LENGTH) is converted into
> > CURLMOPT_MAX_PIPELINE_LENGTH parallel connections!
>
> Oh right. With the minor note that CURLMOPT_MAX_HOST_CONNECTIONS
> could still limit the damage somewhat.

True - for the non-pipeling servers I can set that limit to 8.
Hence, if CURLMOPT_MAX_PIPELINE_LENGTH is set to 32 then still I will
not get more connections than 8. Nevertheless, it seems that also 8
connections is not appreciated by the server (don't ask me why - I have
no control about that at all; I know nothing of that server except what
I can empirically determine.

> Having the knowledge of a host's pipelining capability being dumped
> at the same time we kick out the connection is a pretty sever blow.
> It should really be kept in a separate cache with a much longer
> life-time so that repeated connections to the same hosts would have
> that knowledge immediately.

This has been my approach so far (to try and make libcurl never forget
it). However, it seems there are so many (unexpected) cases where the
connection gets closed that it seems a fuzzy approach that does give a
hard guarantee that the number of connections will stay low (provided
the application limits the number of added easy handles thus, which I'm
sure is not the most typical use case of libcurl as it requires indeed
a pretty sophisticated application). It would therefore be much more
desirable to be able to SET that limit (on a per host:port basis),
so that it is as hard as currently the "global"
CURLMOPT_MAX_HOST_CONNECTIONS.
 
[...]
> > 1) Libcurl creates a new connection for every new request (this
> > is the current behavior).
>
> At least until it hits a max limit.
>
> > 2) Libcurl creates a single connection, and queues all other
> > requests until it received all headers for that first request and
> > decided if the server supports pipelining or not. If it does, it
> > uses the same connection for the the queued requests. If it does
> > not, then it creates new connections for each request as under 1).
>
> So: If there's a transfer going on for the same host but we don't
> know pipelining capability for the connection yet, queue up the
> transfer until we get the answer?

Yes.

> > 3) Libcurl assumes every connection can do pipelining and just
> > sends all requests over the first connection, until it finds out
> > that this server can't do pipelining - in which case the pipe
> > 'breaks' and already existing code will cause all requests to be
> > retried - now knowing that the server does not support pipelining.
>
> That could mean a rather hefty performance blow during situations. So
> again it boils down what your prios are: fewer connections or faster
> (lower latency) transfers.

Agreed, hence I think it should be left up to the user to select this,
or one of the other strategies. The main point to agree upon here is
there are no other alternatives then the ones that I listed. Then we
can make a decision on what to do afterwards :). As argued above
though, I suppose libcurl should just support all algorithms (and let
the user choose). I think that means option 4.

> > 4) Libcurl doesn't wait for headers but uses information from the
> > user to decide if the server is supposed to support http
> > pipelining; meaning that it picks strategy 2) or 3) based on a flag
> > set on the easy handle by the user.
>
> But what are the odds of the application knowing that on a per URL
> basis? Or a per easy-handle basis? It feels like more of a policy
> that you want to set globally. assume pipelining to work or to not
> work.

You can't know what situation libcurl will be used in, now or in the
future. I can only tell you that in my case I know *exactly* which
servers support pipelining - even before making any connection. I'm
just relying a bit on runtime detection to be extra flexible...

If you're thinking of a browser then sure, that you don't know. But
when you think about specific applications that connect with specific
servers, then it seems pretty normal to me that the application knows
up front when it can expect pipelining.

> > I think that option 4 is by far the best experience for the user;
> > however it is the hardest to implement because it requires to
> > implement both 2) and 3) as well as add the code to add a new flag.
[...]
> Option 2 cannot be made the single static behavior, no. That's
> completely out of the question.

Ok.

[...]
> Lovely, but I don't do this full-time and these are rather tricky
> questions so I'm not always very fast to respond, but I will
> certainly try my best to not be a road block!

Thanks!
For the record, I'm an open source code like you. I do this in my free
time without any payment or commercial interest. I just happen to have
a lot of free time :p.

To summarize,

if you agree that I listed all reasonable (to be considered)
algorithms, then it seems that 4) has to be the choice.
Not doing anything means you pick 1). Implementing 4) extends the
functionality of libcurl and would not make it impossible to have to
work like currently (1). The idea is to let the user choose.

One possible way to achieve this is by using the callback that you
mentioned. I really like that idea.

We could add a callback for when a new bundle is created; that would
completely take care of the problem of forgetting that a server is
capable of pipelining as well as of allowing the user to set options on
a per host:port basis.

If every time a connection is made to (to libcurl) unknown service (ie,
a new bundle is created) a callback to the user application happens
where they can specify things like maximum number of connections and
whether or not the server must be assumed to support pipelining, be
blacklisted, or stay undetermined, then ALL of this would be solved
wonderfully!

-- 
Carlo Wood <carlo_at_alinoe.com>
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette:  http://curl.haxx.se/mail/etiquette.html
Received on 2014-10-31