cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: [PATCH 2/7] pipelining: Fix connection handling under timeouts.

From: Carlo Wood <carlo_at_alinoe.com>
Date: Mon, 10 Nov 2014 19:18:47 +0100

On Mon, 10 Nov 2014 11:25:14 +0100 (CET)
Daniel Stenberg <daniel_at_haxx.se> wrote:

> (First: I removed Curl_multi_set_easy_connection just now so this
> patch needs a very minor fix. No biggie of course and I fixed my
> local copy.)

I merged your master into my repository now. Will generate a new diff
from that to test if this system works now ;)

> 1 - Isn't the multi_do_connection_id check basically another way of
> checking that the state hasn't reached DONE ? It seems weird to
> require a specific connection to assume there's pending data to read,
> when you should be able to use the state to check for that instead of
> this new field. Right?

I presume that *if* a request times out while already having reached
state DONE then we shouldn't time it out no? But even if we should,
I don't think that is the state change that interests us.

The problem is this. Picture a literal pipeline where GET messages
go into the pipe on one end (top) and Replies get out on the other
(bottom):

send_pipe->head->prev: GET 6
send_pipe->head: GET 5
traveling to the server: GET 4
queued on the server: GET 3
processed by the server
  and traveling back to us: Reply 2
Received reply and
  reached state DONE: Request 1

If at this moment we find that request nr.1 timed out (before
receiving the reply) then we don't have to close the connection
because its reply was already received. It also reached state DONE, so
there is a correlation between those yes.

If instead we find that request nr.2 timed out then we must close the
connection in order to disregard a possible reply (Reply 2) that might
or might not be received at some point. Not only that, the fact that a
request on this pipeline timed out tells us that the server is bad, I
wouldn't like to reuse this connection for new requests anyway :p.
Again you are right, now the request didn't reach state DONE and we
want to close the connection.

The same reasoning applies to requests 3 and 4: we already sent the GET
messages to the server and can therefore not use the connection anymore
(a reply might still be received; in fact MUST be received before we
could ever get a reply from newly added requests, but we are
disregarding handles now that otherwise we need to keep in the
recv_pipe to account for the proper order in which to process replies.
We could do that despite the timeout, but that doesn't seem wise: the
timeout indicates that closing the connection is a proper action
anyway). Note that when request 3 or 4 time out we don't close the
connection because we might still receive reply 2 (which didn't time
out). In fact, we COULD keep the connection open until we either
receive reply 2 or until the current head on the recv_pipe times out
(any timeout of handles in the receive pipe would wipe that plus all
handles towards the tail; so until the head times out there would still
be something valid in the recv_pipe). I deliberately did not want to
support that in order to keep things simple. In practice all requests
will have roughly the same timeout so if anything times out it will be
handles that were added first: the recv_pipe head. I also think that
in general it indicates a problem with the server/connection so it is
to be expected that the rest will timeout anyway: it is not a bad
decision to simply close the connection, also disregarding a possible
Reply 2, in the rare case that request 3 would timeout first.

However, the same reasoning does not apply to requests 5 and 6: these
didn't send anything to the server yet! If we kill them then we will
NOT receive a reply from the server. The server knows nothing about
them. As stated before, assuming equal timeout values on all requests,
it will be the head (the request with the lowest number (1) that would
timeout first. Hence, in practical cases when a request times out that
is at the send_pipe head (hence, it didn't send anything to the server
yet) it will be at the head of the complete pipeline and there will be
nothing in the pipe at all. This is the case that I ran into: the
connection was JUST created, brand new, but it got requests added to it
that already had timed out (I get back to that below).

The point is however, that these requests did not reach the DONE state
yet. The reason we want to close a connection is therefore not if a
request reached the DONE state yet, but whether or not we already sent
the request off to the server over that connection.

But first, let me address the case where the timeout of request 5 is
smaller than requests 4, 3, 2 and 1, and is the first to timeout. In
that case you COULD argue that the connection is suspicious, because
after all-- normally you don't set tiny short time outs, and thus ANY
timeout would make it plausible for the connection as a whole to be bad
and warrant the strategy to just drop it-- exactly as I argued
before :/. But when I ran into this case that is certainly NOT true:
A new, freshly created connection is created and it DOES get requests
added that already timed out - and then time out after "0 ms"...
obviously that doesn't say anything about the state of the
connection ;). So, I made the decision of closing a connection or not
right as a function of: is the request already sent to the server?
Because if it is, then we MUST close the connection (eventually,
possibly waiting first for requests sent earlier - but that would make
the code more complex), while if it isn't then not only do we not
need to close it but the only practical case (certainly assuming equal
time outs) is the case where the connection is fresh and we definitely
don't want to close it (only causing the connect/close flood that I
observed).

If you agree with that, then what we need is to detect is if a
request has already sent data to the server or not. Not if it already
reached the DONE state. The added multi_do_connection_id is for this
purpose: it should be the id of the connection that this request
already sent data over. If it is -1 then no data was sent, but it can
also be that the id is set to some other connection than we sent data
over (ie, after a retry, or restart at CONNECT because a request
downstream timed out) and then we don't need to close the current
connection of course.

> 2 - For a transfer that you abort due to a timeout, that is part of a
> pipeline and there is data pending you set bits.timedout to prevent
> it from getting more transfers added to the pipeline. But shouldn't
> you get the same effect by setting bits.close and then avoid a new
> struct member for only this purpose?

No, I tried that first. But the result of doing that is that the (new)
connection isn't used until it actually connects. Namely, upon
construction of a new connectdata the close bit is set, as the default,
to close the connection after use - until the user explicitly requests
reuse. This (resetting the close bit) only happens after the
connection succeeds I found out.

The resulting effect was this: all requests in the recv pipe, of which
the head timed out (invalidating everything) created their own new
connection.

Namely (the connection is closed and) all handles are set to
the state CONNECT. The first one is processed and creates a new
connection, but that new connection has the close bit set.
The next handle is being processed and refuses to use the first
connection because it interprets the close bit as this connection
having already sent data... timed out... blah blah. Nonsense thus, but
all requests end up firing a connect() of their own.

Using a new bit seemed a lot easier then to (in fact) abuse an existing
bit with a double meaning (close after DONE / Poisoned due to timeout)
where I'd have to add hacks to achieve that double duty use - hacks
which might be broken accidently again in the future.

> 3 - The changing of the state to DONE as Monty already commented on:
> I believe there's a miniscule risk for a race condition here if the
> timeout is triggered when already in the CURLM_STATE_DONE state, so I
> believe we need a check for that and if already in DONE go to
> CURLM_STATE_COMPLETED instead. Your thoughts?

I don't think that multi_runsingle() can be invoked recursively can it?
Or more to the point, it can not be invoked recursively from the state
CURLM_STATE_DONE.

Assuming that is correct (and I really don't see how could not be) then,
if a handle has data->mstate == CURLM_STATE_DONE and it reaches the
'switch(data->mstate)' of the state machine, then 1) it didn't time out
yet (that was just checked) unless we got there from my code and it is
the intended behavior. 2) it will reach state CURLM_STATE_COMPLETED
before ever entering multi_runsingle() again.

So, if we reach the code where I changed the state to DONE after a
timeout and the state is already CURLM_STATE_DONE then that state
wasn't handled yet, only multistate(data, CURLM_STATE_DONE) was called,
but nothing else.

If the state was set to CURLM_STATE_COMPLETED then we can't reach this
code, it is enclosed in a

    if(data->easy_conn &&
       (data->mstate >= CURLM_STATE_CONNECT) &&
       (data->mstate < CURLM_STATE_COMPLETED)) {
...
      if(timeout_ms < 0) {
...
        data->result = CURLE_OPERATION_TIMEDOUT;
        multistate(data, CURLM_STATE_DONE);
      }
    }

Am I wrong? I don't see what you mean or where a race condition could
exist, or I'm totally missing your point.

-- 
Carlo Wood <carlo_at_alinoe.com>
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette:  http://curl.haxx.se/mail/etiquette.html
Received on 2014-11-10