Menu

#572 crash when dns slow

closed-fixed
libcurl (356)
8
2013-06-21
2006-02-14
xbx
No

I have a strange setup with the dns taking a long time
to answer.
I run my app that make periodic http(s) requests.
Everything goes fine, but after a while (that is, after
maybe 10 successfull requests to the same host) I have
this crash:

#0 0xb7fade01 in Curl_connecthost (conn=0x8a73094,
remotehost=0x0,
sockconn=0x8a7312c, addr=0xbfaefdc4,
connected=0xbfaefe07 "")
at connect.c:789
#1 0xb7f9ed04 in ConnectPlease (conn=0x8a73094,
hostaddr=0x0,
connected=0xbfaefe07 "") at url.c:2017
#2 0xb7fa19f3 in SetupConnection (conn=0x8a73094,
hostaddr=0x0,
protocol_done=0xbfaefe88 "") at url.c:3670
#3 0xb7fa1b08 in Curl_connect (data=0x8ab48bc,
in_connect=0x8b02798,
asyncp=0xbfaefe89 "", protocol_done=0xbfaefe88 "")
at url.c:3729
#4 0xb7faf002 in curl_multi_perform
(multi_handle=0x8aeac1c,
running_handles=0xbfaeff34) at multi.c:422
[....]

(notice the hostaddr=0x0 part)

I'm using libcurl 7.15.1, with this configure options:
./configure --prefix=/usr/ --disable-gopher
--disable-ldap --disable-dict --disable-telnet
--enable-nonblocking --enable-debug
--disable-crypto-auth --enable-ares --disable-ipv6

I could further investigate, but I'd have to know which
way to look..
More precisely,
code = CreateConnection(data, in_connect, &dns, asyncp);
returns:
code == OK
dns == null
asyncp == false
Is the problem that 'code' shouldn't be 'ok', or that
the program shouldn't crash when dns == Null?

Discussion

1 2 > >> (Page 1 of 2)
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-15

    Logged In: YES
    user_id=1110

    Right, can you see if you can figure out why
    CreateConnection() returns CURLE_OK while still not
    returning any address and setting asnyncp to FALSE?

     
  • xbx

    xbx - 2006-02-15

    Logged In: YES
    user_id=1452265

    looks like it's because, url.c::3530
    if(conn->bits.reuse) {
    is true.

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-15

    Logged In: YES
    user_id=1110

    Ah, right. That's NULL if the connection is re-used. So it
    looks like it enters the SetupConnection() function and then
    some 30 lines into the function the condition

    if(CURL_SOCKET_BAD == conn->sock[FIRST])

    equals true. And then it wrongly tries to re-connect using
    the host address that is now NULL.

    Can you figure out how it comes it has marked the socket as
    being bad, while still believing the connection is being
    re-used fine?

    What does a CURLOPT_VERSBOSE output reveal? Does the
    connection just prior to the failing one look fine and claim
    to be leaving the connection alive? Does the failing one
    claim to retry the connection or just re-use it normally
    like other re-used connections?

     
  • xbx

    xbx - 2006-02-15

    Logged In: YES
    user_id=1452265

    The CURLOPT_VERSBOSE doesn't show anything strange, (expect
    that the reuse case isn't triggered as often as I thought..).

    however, in ConnectionExists() there is something strange:
    bool dead = SocketIsDead(check->sock[FIRSTSOCKET]);
    returns False, although check->sock[FIRSTSOCKET]==-1
    So I reckon that's why the (CURL_SOCKET_BAD ==
    conn->sock[FIRST]) triggers in SetupConnection().

    So, maybe SocketIsDead() should check for that case?

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-15

    Logged In: YES
    user_id=1110

    Hm. Yes a first test for you would be to modify the
    SocketIsDead() code to check for that situation.

    I'm a bit reluctant to simply add that check and be happy,
    since I consider it being a problem that the
    sock[FIRSTSOCKET] value ended up like this in the first place...

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-15
    • priority: 5 --> 8
     
  • xbx

    xbx - 2006-02-15

    Logged In: YES
    user_id=1452265

    I think I understood what is going on!

    I changed the SocketIsDead(), and it did remove the crash.
    However the app crashed a bit later, in an ares callback,
    because the connection data had been deleted and overwritten.

    So I guess that what is happening is that a connection is
    being created and waiting for a dns querry before being
    fully initialized.
    And before the dns querry is over, the connection is reused.
    Here the fact that it is "dead" is discovered and the
    connection is killed. And later when the callback from ares
    is called, the connection has been destructed...

    So I guess this is in fact 2 bugs:
    - when a connection is destructed, ares calls should be aborted.
    - the connection shouldn't be reused as it is being
    initialised rather than finished...

    but
    1) I may be wrong
    2) I have no idea how to correct any of these...

     
  • xbx

    xbx - 2006-02-15

    Logged In: YES
    user_id=1452265

    I tried something for:
    > - the connection shouldn't be reused
    > as it is being initialised rather than finished...

    in ConnectionExists(), I added a check so that connections
    with (check->sock[FIRSTSOCKET] == CURL_SOCKET_BAD) wouldn't
    be reused.
    And the program doesn't seem to crash any longer. (can't be
    100% sure, because it is a random bug)

    But I don't know if this condition is supposed to mean
    "being initialised connection" ....

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-15

    Logged In: YES
    user_id=1110

    Ah, I hadn't noticed you use c-ares!

    I'll check your details closer tomorrow. Thanks for all the
    hard work!

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-16

    Logged In: YES
    user_id=1110

    The fact that you made it work by simply not re-using the
    connection (leaving the struct around) but not when you
    disconnect and kill the struct properly, certainly indicates
    that "something" wrongly is trying to use that memory at a
    later time.

    I suspect it might be the c-ares callback that gets called
    when the resolve worked or timed out, because it was somehow
    not properly cancelled in a previous request.

    Any chance you can temporarily remove your fix and add a
    debug output or a break-point in
    hostasync.c:addrinfo_callback() and check if this or a
    related issue is perhaps the case?

    Can you think of a way I could do repeat this problem in my end?

     
  • xbx

    xbx - 2006-02-16

    Logged In: YES
    user_id=1452265

    yes that's what I said, it does crash in
    addrinfo_callback(). With this callstack:

    #0 0xb7ed35a5 in Curl_hash_add (h=0x21, key=0x8a824cc
    "\001:-1209571840",
    key_len=14, p=0x8b159cc) at hash.c:153
    #1 0xb7eb0443 in Curl_cache_addr (data=0x8ae1c68,
    addr=0x8b1580c,
    hostname=0xb7e765d8 "\001", port=-1209571840) at
    hostip.c:328
    #2 0xb7ed72c9 in addrinfo_callback (arg=0x8ae1a2c,
    status=0, addr=0x8ae1fd8)
    at hostasyn.c:129
    #3 0xb7ed7342 in Curl_addrinfo4_callback (arg=0x8ae1a2c,
    status=0,
    hostent=0x8ae1fd8) at hostasyn.c:161
    #4 0xb7edaa73 in end_hquery (hquery=0x8b2c708,
    status=Variable "status" is not available.
    )
    at ares_gethostbyname.c:187
    #5 0xb7edaf27 in host_callback (arg=0x8b2c708, status=0,
    abuf=0xbff164f7 ":v\201\200", alen=128) at
    ares_gethostbyname.c:169
    #6 0xb7ede63c in qcallback (arg=0x8a6b630, status=Variable
    "status" is not available.
    ) at ares_query.c:111
    #7 0xb7ed9cad in end_query (channel=0x8af0778,
    query=0x8b22c20, status=Variable "status" is not available.
    )
    at ares_process.c:693
    #8 0xb7eda1bd in process_answer (channel=0x8af0778,
    abuf=0xbff164f7 ":v\201\200", alen=128, whichserver=1,
    tcp=0,
    now=1140100006) at ares_process.c:373
    #9 0xb7eda73f in ares_process (channel=0x8af0778,
    read_fds=0xbff167b8,
    write_fds=0xbff16738) at ares_process.c:287
    #10 0xb7ed6d0d in Curl_is_resolved (conn=0x8b27e6c,
    dns=0xbff16890)
    at hostares.c:146
    #11 0xb7ed40bd in curl_multi_perform (multi_handle=0x8ad8f4c,
    running_handles=0xbff16944) at multi.c:449
    [...]

    in order to reproduce it, I don't know if it's easy. (I'm
    avoiding to change anything/reboot my computer because I'm
    afraid to lose the situation..)
    but I think the dns querries are very slow because the first
    entry is 127.0.0.1 which is probaly running but not
    configured correctly, and then a real dns ip. So I guess it
    first try to ask 127.0.0.1 and after a timeout it gets the
    answer from the other dns...

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-17

    Logged In: YES
    user_id=1110

    The most interesting in that back trace is this:

    #2 0xb7ed72c9 in addrinfo_callback (arg=0x8ae1a2c,
    status=0, addr=0x8ae1fd8)

    Status zero means OK, so that was a successful resolve
    callback! The question is why does this callback get sent if
    the connection is no longer OK.

    Are you doing several simultanoues transfers with the multi
    interface? I can see a problem in the code involving c-ares,
    DNS timeouts and simultaneous transfers...

     
  • xbx

    xbx - 2006-02-17

    Logged In: YES
    user_id=1452265

    What I imagine is the following scenario:
    - a request is made, a connection (1) is created (but with
    no sockets yet), and a dns request is made via ares.
    - some time after that, an other similar request is made.
    curl reuses (1) because it seems to fit. But at that point
    it the socket is invalid, so it either
    crashes (the initial bug report),
    or, with the "fix" in SocketIsDead(), it notices it,
    closes the connection (1) and starts a new one (2).
    - after some more time, the first dns querry completes
    successfully, the callback is called, with the connection
    (1) as parameter which has been destructed -> crash with
    callstack from message 2006-02-16 15:38 .

    So, according to me the bug isn't in ares, but in curl: when
    it closes connections because SocketIsDead() (which, I think
    is a network failure recovery code; It shouldn't happen in
    normal cases), it should cancel ares request as well (so
    that the callback with destructed pointer isn't used).
    And it shouldn't even be in the case where it think it can
    reuse the connection, as it is still being worked on by a
    previous request. -> that's what I tired to fix (with
    success) by saying "connections with uninitialised sockets
    cannot be reused". but I don't know if it's the correct fix.

    (yes I'm doing simultatenous transfers with mutli.)

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-17

    Logged In: YES
    user_id=1110

    Yes, it looks like the first transfer's ares resolve remains
    "outstanding" when it stops, and the callback for it then
    gets called when a subsequent transfer is running.

    The only place in the code that stops outstanding ares
    resolves is when a DNS timeout is handled, so I guess the
    first transfer doesn't bail out because of this. Can you
    figure out why/how it completes and why there seems to be a
    outstanding resolve left pending?

    I don't think this is a c-ares bug. I believe it is a
    libcurl bug that is related to how libcurl uses c-ares.

    Also, the ares_cancel() function get gets called in the DNS
    timeout case (and only then) cancels _all_ outstanding ares
    resolves which just can't be the right thing to do.

     
  • xbx

    xbx - 2006-02-21

    Logged In: YES
    user_id=1452265

    Got a repro for you!

    What is probably happening is that my program gets tired of
    waiting for a reply and aborts the request, by doing a
    curl_multi_remove_handle().
    (from the manual: "Removing a handle while being used, will
    effectively halt all transfers in progress." )

    However, you get the crash from the initial bug report if
    you do this with latest cvs:

    int test(char *URL)
    {
    CURL *c;
    CURLM *m;
    CURLMcode res;
    int running=1;

    curl_global_init(CURL_GLOBAL_ALL);
    c = curl_easy_init();
    curl_easy_setopt(c, CURLOPT_URL, "http://www.google.com");
    m = curl_multi_init();

    res = curl_multi_add_handle(m, c);

    do {
    res = curl_multi_perform(m, &running);
    } while (res==CURLM_CALL_MULTI_PERFORM);

    curl_multi_remove_handle(m, c);
    curl_easy_setopt(c, CURLOPT_URL, "http://www.google.com");
    res = curl_multi_add_handle(m, c);

    while (running) {
    res = curl_multi_perform(m, &running);
    if (running <= 0) {
    fprintf(stderr, "nothing left running.\n");
    break;
    }
    }
    curl_multi_remove_handle(m, c);
    curl_easy_cleanup(c);
    curl_multi_cleanup(m);

    return res;
    }

    (obviously, my program doesn't do smething as dumb, but when
    then dns takes ages to respond, I guess this is equivalent.)

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-21

    Logged In: YES
    user_id=1110

    Splendid! I'll get working on this as soon as I can.

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-22

    Logged In: YES
    user_id=1110

    Your sample program segfaults repeatedly.

    I could make it stop and behave fine using the patch below,
    and I'd be interested to know if this fixes the "real"
    problem you've been seeing. It basically just an added call
    to Curl_done() when the handle is removed from the multi
    stack and the transfer hasn't completed:

    --- lib/multi.c 11 Feb 2006 22:35:17 -0000 1.69
    +++ lib/multi.c 22 Feb 2006 11:53:35 -0000
    @@ -245,6 +245,8 @@
    Curl_easy_addmulti(easy->easy_handle, NULL); /* clear
    the association
    to this
    multi handle */

    + Curl_done(&easy->easy_conn, easy->result);
    +
    /* make the previous node point to our next */
    if(easy->prev)
    easy->prev->next = easy->next;

     
  • xbx

    xbx - 2006-02-22

    Logged In: YES
    user_id=1452265

    I tried it.

    1) without the patch, it crashes as usual.

    2) with the line added, it crashes earlier in Curl_done
    because easy_conn == Null

    3) with this patch added to multi.c:
    if (easy->easy_conn)
    Curl_done(&easy->easy_conn, easy->result);
    it actually triggers when the programs decides to abort the
    request after a tieout.
    and it supresses the crash from the bug report.
    However, there still is the crash a bit later when the ares
    request completes and calls the callbak with destroyed
    pointers..

    with this stack:
    #0 0xb7f067a9 in addrinfo_callback (arg=0x8b19e3c,
    status=0, addr=0x8a89894) at hostasyn.c:126
    #1 0xb7f0685a in Curl_addrinfo4_callback (arg=0x8b19e3c,
    status=0, hostent=0x8a89894) at hostasyn.c:161
    #2 0xb7f0c97a in end_hquery (hquery=0x8a9f26c, status=0,
    host=0x8a89894) at ares_gethostbyname.c:187
    #3 0xb7f0c8eb in host_callback (arg=0x8a9f26c, status=0,
    abuf=0xbfb4504b "\221\233\201\200", alen=128)
    at ares_gethostbyname.c:169
    #4 0xb7f0a735 in qcallback (arg=0x8ae3444, status=0,
    abuf=0xbfb4504b "\221\233\201\200", alen=128)
    at ares_query.c:111
    #5 0xb7f0a3ba in end_query (channel=0x8a9caec,
    query=0x8b198c4, status=0,
    abuf=0xbfb4504b "\221\233\201\200", alen=128) at
    ares_process.c:696
    #6 0xb7f09b2b in process_answer (channel=0x8a9caec,
    abuf=0xbfb4504b "\221\233\201\200", alen=128,
    whichserver=1, tcp=0, now=1140620161) at ares_process.c:375
    #7 0xb7f0990c in read_udp_packets (channel=0x8a9caec,
    read_fds=0xbfb45338, now=1140620161)
    at ares_process.c:289
    #8 0xb7f09349 in ares_process (channel=0x8a9caec,
    read_fds=0xbfb45338, write_fds=0xbfb452b8)
    at ares_process.c:93
    #9 0xb7f06225 in Curl_is_resolved (conn=0x8b306ec,
    dns=0xbfb45410) at hostares.c:146
    #10 0xb7f03594 in curl_multi_perform
    (multi_handle=0x89c8e2c, running_handles=0xbfb454c4) at
    multi.c:453
    [...]

    arg=0x8b19e3c being an invalid pointer.

    looks like Curl_done doesn't abort pending ares requests..

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-22

    Logged In: YES
    user_id=1110

    Right. I guess it just didn't trigger for me since it
    resolved the name "too fast". What about this additional
    patch. It takes a little extra #ifdef magic to be really
    nice for non-ares cases but I first need to know if this
    cures your problem!

    diff -u -r1.492 url.c
    --- lib/url.c 21 Feb 2006 07:46:41 -0000 1.492
    +++ lib/url.c 22 Feb 2006 17:28:58 -0000
    @@ -3999,6 +3999,10 @@

    Curl_pgrsDone(conn); /* done with the operation */

    + /* for ares-using, make sure all possible outstanding
    requests are properly
    + cancelled before we proceed */
    + ares_cancel(data->state.areschannel);
    +
    /* if data->set.reuse_forbid is TRUE, it means the
    libcurl client has
    forced us to close this no matter what we think.

     
  • xbx

    xbx - 2006-02-22

    Logged In: YES
    user_id=1452265

    seems to cure my crash...
    great!

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-22
    • status: open --> open-fixed
     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-22

    Logged In: YES
    user_id=1110

    Lovely!

    If you tell me your name, I'll make sure you give you proper
    credits for your hard work in the changelog.

     
  • xbx

    xbx - 2006-02-23

    Logged In: YES
    user_id=1452265

    No!
    thank *you* for the hard work and the lib!

     
  • Daniel Stenberg

    Daniel Stenberg - 2006-02-23
    • status: open-fixed --> closed-fixed
     
  • xbx

    xbx - 2006-02-23

    Logged In: YES
    user_id=1452265

    just did "cvs update", the fix is in! but...
    1) in multi.c:249, you'd need
    if (easy->easy_conn)
    Curl_done(&easy->easy_conn, easy->result);
    instead of
    Curl_done(&easy->easy_conn, easy->result);
    (otherwise it crashes for me, for instance when a request
    completes successfully)

    2) in url.c
    didn't you say you needed to warp
    ares_cancel(data->state.areschannel);
    around #ifdef USE_ARES / #endif ?

     
1 2 > >> (Page 1 of 2)