Menu

#792 dns cache memory leak and TTL failure after failed conn

closed-fixed
libcurl (356)
5
2014-12-18
2008-12-10
No

After a failed connection attempt, the DNS cache entry appears never to be unlocked. This results in a memory leak and an ever-increasing inuse count. And, since the cache entry can only ever be pruned when both the TTL (DNS_CACHE_TIMEOUT) is expired and inuse=0, the hostname will never be reresolved for the lifetime of that DNS cache.

I have attached a simple program that demonstrates the problem. It takes a URL to download on the command line. Make sure the hostname resolves to an IP that just times out. If you change the hostname's resolution, libcurl continues trying to connect to the original IP forever (as observed by netstat, for example). Also, adding debug print to url.c shows the increasing "inuse" count.

My patch to solve this issue is included below, though you may prefer to solve the problem a different way. (I have only lightly tested the patch at this point.)

--- curl-7.19.2.orig/lib/url.c Tue Dec 9 23:10:34 2008
+++ curl-7.19.2/lib/url.c Wed Dec 10 03:16:48 2008
@@ -2699,6 +2699,7 @@ ConnectionStore(struct SessionHandle *da
static CURLcode ConnectPlease(struct SessionHandle *data,
struct connectdata *conn,
struct Curl_dns_entry *hostaddr,
+ bool *hostaddr_used,
bool *connected)
{
CURLcode result;
@@ -2721,6 +2722,12 @@ static CURLcode ConnectPlease(struct Ses
connected);
if(CURLE_OK == result) {
/* All is cool, then we store the current information */
+ if (conn->dns_entry != hostaddr) {
+ if (conn->dns_entry && !*hostaddr_used)
+ Curl_resolv_unlock(data, conn->dns_entry);
+ if (hostaddr)
+ *hostaddr_used = TRUE;
+ }
conn->dns_entry = hostaddr;
conn->ip_addr = addr;

@@ -4354,6 +4361,7 @@ static CURLcode create_conn(struct Sessi

static CURLcode setup_conn(struct connectdata *conn,
struct Curl_dns_entry *hostaddr,
+ bool *hostaddr_used,
bool *protocol_done)
{
CURLcode result=CURLE_OK;
@@ -4406,7 +4414,8 @@ static CURLcode setup_conn(struct connec
* only for the case where we re-use an existing connection and thus
* this code section will not be reached with hostaddr == NULL.
*/
- result = ConnectPlease(data, conn, hostaddr, &connected);
+ result = ConnectPlease(data, conn, hostaddr, hostaddr_used, &connected);
+ free_hostaddr = (result != CURLE_OK);

if(connected) {
result = Curl_protocol_connect(conn, protocol_done);
@@ -4468,6 +4477,7 @@ CURLcode Curl_connect(struct SessionHand
{
CURLcode code;
struct Curl_dns_entry *dns;
+ bool dns_used = FALSE;

*asyncp = FALSE; /* assume synchronous resolves by default */

@@ -4485,12 +4495,16 @@ CURLcode Curl_connect(struct SessionHand
/* If an address is available it means that we already have the name
resolved, OR it isn't async. if this is a re-used connection 'dns'
will be NULL here. Continue connecting from here */
- code = setup_conn(*in_connect, dns, protocol_done);
+ code = setup_conn(*in_connect, dns, &dns_used, protocol_done);
/* else
response will be received and treated async wise */
}
}

+ /* Free dns entry if it was not used */
+ if (dns && !dns_used)
+ Curl_resolv_unlock(data, dns);
+
if(CURLE_OK != code && *in_connect) {
/* We're not allowed to return failure with memory left allocated
in the connectdata struct, free those here */
@@ -4511,7 +4525,11 @@ CURLcode Curl_async_resolved(struct conn
{
#if defined(USE_ARES) || defined(USE_THREADING_GETHOSTBYNAME) || \ defined(USE_THREADING_GETADDRINFO)
- CURLcode code = setup_conn(conn, conn->async.dns, protocol_done);
+ bool dns_used = FALSE;
+ CURLcode code = setup_conn(conn, conn->async.dns, &dns_used, protocol_done);
+
+ if (conn->async.dns && !dns_used)
+ Curl_resolv_unlock(data, conn->async.dns);

if(code)
/* We're not allowed to return failure with memory left allocated

Discussion

  • Phil Lisiecki

    Phil Lisiecki - 2008-12-10

    Simple example program

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-10

    Revised patch (one line from a previous solution attempt slipped into the previous patch accidentally):

    --- curl-7.19.2.orig/lib/url.c Tue Dec 9 23:10:34 2008
    +++ curl-7.19.2/lib/url.c Wed Dec 10 03:16:48 2008
    @@ -2699,6 +2699,7 @@ ConnectionStore(struct SessionHandle *da
    static CURLcode ConnectPlease(struct SessionHandle *data,
    struct connectdata *conn,
    struct Curl_dns_entry *hostaddr,
    + bool *hostaddr_used,
    bool *connected)
    {
    CURLcode result;
    @@ -2721,6 +2722,12 @@ static CURLcode ConnectPlease(struct Ses
    connected);
    if(CURLE_OK == result) {
    /* All is cool, then we store the current information */
    + if (conn->dns_entry != hostaddr) {
    + if (conn->dns_entry && !*hostaddr_used)
    + Curl_resolv_unlock(data, conn->dns_entry);
    + if (hostaddr)
    + *hostaddr_used = TRUE;
    + }
    conn->dns_entry = hostaddr;
    conn->ip_addr = addr;

    @@ -4354,6 +4361,7 @@ static CURLcode create_conn(struct Sessi

    static CURLcode setup_conn(struct connectdata *conn,
    struct Curl_dns_entry *hostaddr,
    + bool *hostaddr_used,
    bool *protocol_done)
    {
    CURLcode result=CURLE_OK;
    @@ -4406,7 +4414,7 @@ static CURLcode setup_conn(struct connec
    * only for the case where we re-use an existing connection and thus
    * this code section will not be reached with hostaddr == NULL.
    */
    - result = ConnectPlease(data, conn, hostaddr, &connected);
    + result = ConnectPlease(data, conn, hostaddr, hostaddr_used, &connected);

    if(connected) {
    result = Curl_protocol_connect(conn, protocol_done);
    @@ -4468,6 +4476,7 @@ CURLcode Curl_connect(struct SessionHand
    {
    CURLcode code;
    struct Curl_dns_entry *dns;
    + bool dns_used = FALSE;

    *asyncp = FALSE; /* assume synchronous resolves by default */

    @@ -4485,12 +4494,16 @@ CURLcode Curl_connect(struct SessionHand
    /* If an address is available it means that we already have the name
    resolved, OR it isn't async. if this is a re-used connection 'dns'
    will be NULL here. Continue connecting from here */
    - code = setup_conn(*in_connect, dns, protocol_done);
    + code = setup_conn(*in_connect, dns, &dns_used, protocol_done);
    /* else
    response will be received and treated async wise */
    }
    }

    + /* Free dns entry if it was not used */
    + if (dns && !dns_used)
    + Curl_resolv_unlock(data, dns);
    +
    if(CURLE_OK != code && *in_connect) {
    /* We're not allowed to return failure with memory left allocated
    in the connectdata struct, free those here */
    @@ -4511,7 +4524,11 @@ CURLcode Curl_async_resolved(struct conn
    {
    #if defined(USE_ARES) || defined(USE_THREADING_GETHOSTBYNAME) || \ defined(USE_THREADING_GETADDRINFO)
    - CURLcode code = setup_conn(conn, conn->async.dns, protocol_done);
    + bool dns_used = FALSE;
    + CURLcode code = setup_conn(conn, conn->async.dns, &dns_used, protocol_done);
    +
    + if (conn->async.dns && !dns_used)
    + Curl_resolv_unlock(data, conn->async.dns);

    if(code)
    /* We're not allowed to return failure with memory left allocated

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-10

    Thanks for the report and patch.

    Can you please submit it as an attachment to this report, as it gets manged by this comment system when inlined and thus is somewhat hard to extract and apply...

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-11

    Proposed patch

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-11

    Posting patch as an attachment as requested
    File Added: patch

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-12

    my edited (pure C) version that shows no leak to me

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-12

    I just uploaded my version of your test code "2413067.c" which is a C version that exits and it does not show any leaks to me. Not with libcurls own leak tracker nor with valgrind. Using the current libcurl in CVS built on linux with c-ares.

    How do you "prove" the leak? What resolver do you use? What operating system do you use? Do you see a leak with my version of the test too?

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    I just uploaded my version of your test code "2413067.c" which is a C
    version that exits and it does not show any leaks to me. Not with libcurls
    own leak tracker nor with valgrind. Using the current libcurl in CVS built
    on linux with c-ares.

    How do you "prove" the leak? What resolver do you use? What operating
    system do you use? Do you see a leak with my version of the test too?

    First the simple stuff. I am using a bind9 under a linux-2.4-based
    kernel, gcc-3.3. "dig" confirms that the DNS entries are updating
    successfully on my client machine. I do see the TTL problem with your
    example, but the "leak" is hard to see with just one hostname...

    I should have been more specific with my definition of "leak." All of
    the allocated memory is associated with the DNS cache, which in turn
    is associated with the easy handle. So, curl_easy_cleanup() will find
    all of the memory that was allocated and free it appropriately. But,
    a long-running process which uses only a single easy handle will use
    an amount of memory proportional to the number of host names it
    resolves to broken IPs. As I understand the intended memory usage,
    only cache entries used in the last DNS_CACHE_TIMEOUT should be kept
    around.

    The original problem which I observed in my application that led me to
    investigate the problem was actually the broken TTL issue. My
    application repeatedly makes requests against some particular
    hostname. This host died, so I updated the DNS record to point to a
    new IP, but my application continued connecting to the original host
    forever (until I restarted the daemon by hand).

    Reproducing the TTL problem:

    I augmented your C example to print the PRIMARY_IP after each trial
    and to sleep 20 seconds between trials. (2413067_ttl.c)

    Then, I set up a domain name that points to a nonexistent host with a
    30-second TTL and started the test program and saw that it failed to
    connect to the bad IP. Then, I change the domain name and confirmed
    that the machine's local named (/etc/resolv.conf points to 127.0.0.1)
    had picked up the update within 30 seconds. But, the example program
    continued to print the wrong IP even after another 60 seconds had
    elapsed. "netstat" also shows "SYN_SENT" to the old IP as well, so
    this isn't just a bug with the PRIMARY_IP interface.

    Also, if you attach gdb or add debug print, you will see that the
    "inuse" counter on the host entry continues to increase.

    Reproducing the "leak":

    To observe the actual "leak," I have made another slightly modified
    version of your test program (2413067_leak.c). This program loops
    over ten domain names (by inserting the numbers 1-10 between argv[1]
    and argv[2]), with a 20-second sleep between hosts. If all 10
    hostnames resolve to nonexistent hosts, you would expect the cache
    size to grow to about 3-4 entries (I am using a post on the same
    subnet such that connect times out after 3 seconds; off subnet, the
    timeout will be much longer, such that you might expect 2 or fewer
    entries). To show the number of entries, I added a printf to
    Curl_hostcache_prune (patch 2413067_leak_debug_print_patch). Instead,
    the cache grows all the way to 10 entries. You could extend this to
    an arbitrary number of entries if you want...

    Here are the post-prune cache sizes I saw:

    post-prune cache size: 1
    post-prune cache size: 2
    post-prune cache size: 3
    post-prune cache size: 4
    post-prune cache size: 5
    post-prune cache size: 6
    post-prune cache size: 7
    post-prune cache size: 8
    post-prune cache size: 9
    post-prune cache size: 10

    Note that the expected cache behavior happens with a successful
    connection. I used this "web server" to answer curl's requests:

    while true; do (echo 'HTTP/1.0 200 OK'; echo 'Connection: close'; echo 'Content-Length: 0'; echo) | nc -l -p 80; echo hi; done

    post-prune cache size: 1
    post-prune cache size: 2
    post-prune cache size: 3
    post-prune cache size: 3
    post-prune cache size: 3
    post-prune cache size: 3
    post-prune cache size: 3
    post-prune cache size: 3
    post-prune cache size: 3
    post-prune cache size: 3

    File Added: 2413067_ttl.c

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    TTL problem example

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    Another attachment...
    File Added: 2413067_leak.c

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    "leak" example program

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    "leak" problem debugging patch

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    Another attachment...
    File Added: 2413067_leak_debug_print_patch

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    A couple small corrections/clarifications to my previous comment:

    > This isn't just a bug with the PRIMARY_IP interface.

    I did not mean to imply that there might be any sort of bug with the PRIMARY_IP interface, but rather to state that the behavior was confirmed to be real another way as well.

    > I am using a post on the same subnet...

    I meant, "I am using a _host_ on the same subnet" (though that was probably obvious).

     
  • Dan Fandrich

    Dan Fandrich - 2008-12-13

    What version of glibc are you using? It appears that at least glibc 2.2.5 has a memory leak somewhere, as evidenced by 4 failing tests on one of the Red Hat 7.2 autobuilds at http://curl.haxx.se/auto/ Keep that in mind as you investigate the cause of this.

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-13

    I am indeed using 2.2.5, but I'm positive this is not a libc issue, since the memory _is_ associated with the curl DNS cache (which is why curl_easy_cleanup can free the memory; also see the "post-prune cache size" debug print). I have also observed the "leak" directly in curl's dns cache entry "inuse" counter. The "inuse" counter should never exceed "1" with a non-shared cache, but it keeps counting up!

    The TTL symptom is also definitely not glibc (e.g., caching internally instead of re-resolving) since my patch solves the problem.

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-13

    Running with latest CVS libcurl with your 2413067_leak_debug_print_patch applied, testing your 2413067_leak.c code, I _don't_ get any increased "post-prune cache size" when I run it with 10 host names that don't exist. The output just keeps repeating:

    $ ./debugit foo ""

    (this piece was what we saw for each host)
    url: foo6
    post-prune cache size: 0
    6 returned from curl_easy_perform
    IP was

    I tried with both my regular c-ares build and a non-c-ares one (getaddrinfo) and they showed the same output.

    I then tried by adding 'moo1' to 'moo5' in my /etc/hosts and ran it as "./debugit moo ""' but that too ended up with "post-prune cache size: 0" in the output...

    I'll now skip right to reading the patch to see if I can understand this more!

     
  • Phil Lisiecki

    Phil Lisiecki - 2008-12-14

    I think the problem should not occur if the DNS resolution fails
    (NXDOMAIN, etc.). The DNS resolution needs to succeed, but then the
    connect attempt must fail. I originally found the problem when the
    target IP didn't answer at all. I have more recently confirmed that
    the same problem shows up when the connection is refused, e.g., by
    specifying a port number where there is no server listening.

    I have not tried the CVS head version. If you think there may be
    something fixed since 7.19.2, I can also try the latest code on
    Monday...

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-15

    Ah yes. If I let the resolves work but the connects fail I can repeat the problem. Thanks.

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-16

    Thanks for your work, with your patch I too can see that the cache is no longer growing. However I'm not entirely satisfied with the patch and here are my questions/nits on it:

    General comment: I lack added comments in the code about how this new code works and why.

    setup_conn() - why does it need the hostaddr_used argument, can't we just make sure to "unlock" the resource within the function (or even preferably within the ConnectPlease function) rather than to return that state to let the parent do the job? And if it has to do it, why does it set hostaddr_used on its own?

    ConnectPlease() - I don't quite understand the logic around when then dns entry is unlocked. Can you clarify the reasoning with a comment?

    free_hostaddr - within the setup_conn seems to be the start of something like that but right now it is just an assignment of a variable that isn't used?

    Curl_async_resolved() didn't build with the patch so I had to modify it manually, but perhaps more importantly: why is the unlocking of the dns entry dependent on the "conn->async.dns" being true and not just dns_used being false?

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-26

    Please, even though I can repeat the problem I can neither accept nor apply this patch without some further work!

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-29

    Thanks for the report, this problem is now fixed in CVS!

     
  • Daniel Stenberg

    Daniel Stenberg - 2008-12-29
    • status: open --> closed-fixed