Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults with http2 on libcurl 7.60.0 #2674

Closed
jeroen opened this issue Jun 21, 2018 · 45 comments
Closed

Segfaults with http2 on libcurl 7.60.0 #2674

jeroen opened this issue Jun 21, 2018 · 45 comments

Comments

@jeroen
Copy link
Contributor

jeroen commented Jun 21, 2018

As of recently my programs started to segfault when Debian switch to libcurl with openssl+nghttp2. I think there are may be multiple issues.

Here is a simple program (link) that does the following:

 create one multi-handle
 repeat 20x {
    - add 100 http/2 requests to the multi-handle
    - run multi handle until all requests are done
    - remove requests from the multi-handle
  }

The program works on my macos with native libcurl (curl 7.54.0 (x86_64-apple-darwin17.0) libcurl/7.54.0 LibreSSL/2.0.20 zlib/1.2.11 nghttp2/1.24.0)

However on Debian (buster) the program segfaults after a while, both when compiled with libcurl4-openssl-dev as well as libcurl4-gnutls-dev. Both versions are libcurl 7.60.0.

I can also reproduce the problem on MacOS if I build libcurl with nghttp2 support:

brew install curl --with-nghttp2
brew link --force curl

And now the program will segfault here as well:

clang test.c $(pkg-config --cflags --libs libcurl)
./a.out

# Round 0: adding 100 handles...
# Completed: 100 OK!
# Round 1: adding 100 handles...
# Completed: 100 OK!
# Round 2: adding 100 handles...
# Segmentation fault: 11

And if I then brew unlink curl and recompile, everything is fine again.

@jeroen
Copy link
Contributor Author

jeroen commented Jun 21, 2018

Here is a backtrace from the above program on macos:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x228)
  * frame #0: 0x00000001000dc0f4 libcurl.4.dylib`h2_session_send + 79
    frame #1: 0x00000001000db019 libcurl.4.dylib`h2_process_pending_input + 147
    frame #2: 0x00000001000dc4f3 libcurl.4.dylib`http2_conncheck + 202
    frame #3: 0x00000001000b9211 libcurl.4.dylib`extract_if_dead + 81
    frame #4: 0x00000001000b7b62 libcurl.4.dylib`create_conn + 11961
    frame #5: 0x00000001000b4c35 libcurl.4.dylib`Curl_connect + 30
    frame #6: 0x00000001000c75de libcurl.4.dylib`multi_runsingle + 987
    frame #7: 0x00000001000c7110 libcurl.4.dylib`curl_multi_perform + 133
    frame #8: 0x0000000100000c92 a.out`main at test.c:57
    frame #9: 0x00007fff59734015 libdyld.dylib`start + 1

Also sometimes it crashes here:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00000001000c9397 libcurl.4.dylib`Curl_set_in_callback + 28
    frame #1: 0x00000001000ae9b4 libcurl.4.dylib`showit + 46
    frame #2: 0x00000001000adf2b libcurl.4.dylib`Curl_infof + 227
    frame #3: 0x00000001000dc4ce libcurl.4.dylib`http2_conncheck + 165
    frame #4: 0x00000001000b9211 libcurl.4.dylib`extract_if_dead + 81
    frame #5: 0x00000001000b7b62 libcurl.4.dylib`create_conn + 11961
    frame #6: 0x00000001000b4c35 libcurl.4.dylib`Curl_connect + 30
    frame #7: 0x00000001000c75de libcurl.4.dylib`multi_runsingle + 987
    frame #8: 0x00000001000c7110 libcurl.4.dylib`curl_multi_perform + 133
    frame #9: 0x0000000100000c92 a.out`main at test.c:57
    frame #10: 0x00007fff59734015 libdyld.dylib`start + 1

@bagder
Copy link
Member

bagder commented Jun 21, 2018

This commit might be the fix you want: 2c15693. It fixed basically the same symptoms.

@jeroen
Copy link
Contributor Author

jeroen commented Jun 21, 2018

I tried rebuilding against the master branch. That fixes the example but my R program still crashes. I haven't been able to translate it to a C example yet but here is the backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xb8)
  * frame #0: 0x0000000106937ebc libcurl.4.dylib`Curl_conncache_bundle_size + 12
    frame #1: 0x0000000106917938 libcurl.4.dylib`create_conn + 13880
    frame #2: 0x000000010691428c libcurl.4.dylib`Curl_connect + 30
    frame #3: 0x00000001069264f6 libcurl.4.dylib`multi_runsingle + 987
    frame #4: 0x0000000106926028 libcurl.4.dylib`curl_multi_perform + 133
    frame #5: 0x00000001039dfdfa curl.so`R_multi_run + 1530
    frame #6: 0x0000000100123543 libR.dylib`R_doDotCall(ofun=<unavailable>, nargs=<unavailable>, cargs=<unavailable>, call=0x0000000106146d98) at dotcode.c:576 [opt]
    frame #7: 0x000000010016352b libR.dylib`bcEval(body=<unavailable>, rho=0x00000001060c0318, useCache=<unavailable>) at eval.c:7279 [opt]
    frame #8: 0x000000010015386b libR.dylib`Rf_eval(e=<unavailable>, rho=<unavailable>) at eval.c:624 [opt]
    frame #9: 0x00000001001661bc libR.dylib`R_execClosure(call=0x0000000106142200, newrho=0x00000001060c0318, sysparent=<unavailable>, rho=<unavailable>, arglist=<unavailable>, op=<unavailable>) at eval.c:0 [opt]
    frame #10: 0x0000000100158883 libR.dylib`bcEval(body=<unavailable>, rho=0x0000000101062f98, useCache=<unavailable>) at eval.c:6739 [opt]
    frame #11: 0x000000010015386b libR.dylib`Rf_eval(e=<unavailable>, rho=<unavailable>) at eval.c:624 [opt]
    frame #12: 0x00000001001645a3 libR.dylib`forcePromise(e=0x00000001060c0660) at eval.c:520 [opt]
    frame #13: 0x00000001001538f2 libR.dylib`Rf_eval(e=0x00000001060c0660, rho=<unavailable>) at eval.c:660 [opt]
    frame #14: 0x00000001001a873a libR.dylib`do_usemethod(call=0x000000010612e708, op=<unavailable>, args=<unavailable>, env=0x00000001060c0580) at objects.c:522 [opt]
    frame #15: 0x0000000100158e39 libR.dylib`bcEval(body=<unavailable>, rho=0x00000001060c0580, useCache=<unavailable>) at eval.c:6791 [opt]
    frame #16: 0x000000010015386b libR.dylib`Rf_eval(e=<unavailable>, rho=<unavailable>) at eval.c:624 [opt]
    frame #17: 0x00000001001661bc libR.dylib`R_execClosure(call=0x000000010612e7e8, newrho=0x00000001060c0580, sysparent=<unavailable>, rho=<unavailable>, arglist=<unavailable>, op=<unavailable>) at eval.c:0 [opt]
    frame #18: 0x0000000100158883 libR.dylib`bcEval(body=<unavailable>, rho=0x0000000101062f98, useCache=<unavailable>) at eval.c:6739 [opt]
    frame #19: 0x000000010016801c libR.dylib`R_compileAndExecute(call=<unavailable>, rho=<unavailable>) at eval.c:1406 [opt]
    frame #20: 0x00000001001680b9 libR.dylib`do_while(call=0x00000001059ff248, op=0x000000010102cb80, args=0x00000001059ff280, rho=0x0000000101062f98) at eval.c:2308 [opt]
    frame #21: 0x0000000100153a5e libR.dylib`Rf_eval(e=<unavailable>, rho=0x0000000101062f98) at eval.c:699 [opt]
    frame #22: 0x0000000100196efa libR.dylib`Rf_ReplIteration(rho=0x0000000101062f98, savestack=<unavailable>, browselevel=0, state=0x00007ffeefbfe7d0) at main.c:258 [opt]
    frame #23: 0x00000001001984df libR.dylib`run_Rmainloop [inlined] R_ReplConsole(rho=<unavailable>, savestack=0, browselevel=0) at main.c:308 [opt]
    frame #24: 0x0000000100198476 libR.dylib`run_Rmainloop at main.c:1082 [opt]
    frame #25: 0x0000000100000f5b R`main + 27
    frame #26: 0x00007fff59734015 libdyld.dylib`start + 1
    frame #27: 0x00007fff59734015 libdyld.dylib`start + 1

@bagder
Copy link
Member

bagder commented Jun 21, 2018

Try this additional patch, which really is a follow-up to that commit I mentioned above:

diff --git a/lib/url.c b/lib/url.c
index 0cab0a303..f1a6e8db4 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -1463,10 +1463,11 @@ ConnectionExists(struct Curl_easy *data,
   }
 
   if(chosen) {
     /* mark it as used before releasing the lock */
     chosen->inuse = TRUE;
+    chosen->data = data; /* own it! */
     Curl_conncache_unlock(needle);
     *usethis = chosen;
     return TRUE; /* yes, we found one to use! */
   }
   Curl_conncache_unlock(needle);

@jeroen
Copy link
Contributor Author

jeroen commented Jun 21, 2018

Yes this seems to resolve my immediate crash, so that is an improvement :-) I would need to investigate if this also resolves my other issues.

@bagder
Copy link
Member

bagder commented Jun 21, 2018

Great! I'll hold off merging this for a while to see if you find something else as well...

@jeroen
Copy link
Contributor Author

jeroen commented Jun 21, 2018

Everything looks good on MacOS now. The crashes were more frequent on Debian, but I don't have time to build a version of libcurl with nghttp2 from source on Linux. Perhaps once this fix is in master I can ask the Debian maintainer of libcurl to give it a try.

@jeroen
Copy link
Contributor Author

jeroen commented Jun 22, 2018

I'm still getting an occosional crash. Might be another problem. Here is a backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000105f9494f libcurl.4.dylib`Curl_http_readwrite_headers + 1862
    frame #1: 0x0000000105fa6f51 libcurl.4.dylib`Curl_readwrite + 931
    frame #2: 0x0000000105faec71 libcurl.4.dylib`multi_runsingle + 2902
    frame #3: 0x0000000105fae028 libcurl.4.dylib`curl_multi_perform + 133
    frame #4: 0x00000001042e02ea curl.so`R_multi_run + 1530
    frame #5: 0x0000000100123543 libR.dylib`R_doDotCall(ofun=<unavailable>, nargs=<unavailable>, cargs=<unavailable>, call=0x00000001064ec950) at dotcode.c:576 [opt]
    frame #6: 0x000000010016352b libR.dylib`bcEval(body=<unavailable>, rho=0x00000001064e3408, useCache=<unavailable>) at eval.c:7279 [opt]
    frame #7: 0x000000010015386b libR.dylib`Rf_eval(e=<unavailable>, rho=<unavailable>) at eval.c:624 [opt]
    frame #8: 0x00000001001661bc libR.dylib`R_execClosure(call=0x00000001064fd4a8, newrho=0x00000001064e3408, sysparent=<unavailable>, rho=<unavailable>, arglist=<unavailable>, op=<unavailable>) at eval.c:0 [opt]

@bagder
Copy link
Member

bagder commented Jun 22, 2018

so which pointer is NULL there? http.c:1862 doesn't seem to be the same line in my code...

@jeroen
Copy link
Contributor Author

jeroen commented Jun 22, 2018

I think that is a byte offset, not a line number. Itried building libcurl with --enable-debug --enable-curldebug --disable-hidden-symbols but I'm not getting a better back trace. Here is some output other debug output at least:

stream: adjusting kept idle streams num_idle_streams=0, max=100
recv: connection recv_window_size=580188, local_window=1073741824
stream: adjusting kept idle streams num_idle_streams=0, max=100
recv: [IB_READ_DATA]
recv: readlen=1179, payloadleft=2841
recv: data_readlen=1179
stream: adjusting kept idle streams num_idle_streams=0, max=100
recv: connection recv_window_size=581367, local_window=1073741824
stream: adjusting kept idle streams num_idle_streams=0, max=100
recv: [IB_READ_DATA]
recv: readlen=1179, payloadleft=1662
recv: data_readlen=1179
stream: adjusting kept idle streams num_idle_streams=0, max=100
Process 655 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010788da55 libcurl.4.dylib`Curl_http_readwrite_headers + 1886
libcurl.4.dylib`Curl_http_readwrite_headers:
->  0x10788da55 <+1886>: movl   $0x2, (%rax)
    0x10788da5b <+1892>: movq   -0x48(%rbp), %rax
    0x10788da5f <+1896>: movl   (%rax), %eax
    0x10788da61 <+1898>: cmpl   $0x130, %eax              ; imm = 0x130
Target 0: (R) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000010788da55 libcurl.4.dylib`Curl_http_readwrite_headers + 1886
    frame #1: 0x00000001078a0e3c libcurl.4.dylib`Curl_readwrite + 918
    frame #2: 0x00000001078a9c41 libcurl.4.dylib`multi_runsingle + 2623
    frame #3: 0x00000001078a910f libcurl.4.dylib`curl_multi_perform + 133
    frame #4: 0x00000001047d52ea curl.so`R_multi_run + 1530

@dtzWill
Copy link
Contributor

dtzWill commented Jun 22, 2018

I can reproduce (using multi-stress.cpp, Linux) using 7.60. Failures I'm seeing appear similar to the ones discussed in #2669 and recently fixed on master branch.

When building against curl 7.60 + first fix on that PR => can't seem to reproduce
When building against curl 'master' => can't seem to reproduce

Testing using address-sanitizer and debug configure flags.

If you're able to reproduce on Linux using patched curl (or latest git), please let me know --and any suggestions of things to "encourage" seeing this bug would help reproduce--which is important for ability to produce a fix.

@jeroen
Copy link
Contributor Author

jeroen commented Jun 22, 2018

@dtzWill Perhaps this is another unrelated problem. It happens randomly in a very complex web crawler application, so it's not trivial to produce a minimal C example.

The good news is I was finally able to create a proper backtrace:

Process 78108 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000106a269b7 libcurl.4.dylib`Curl_http_readwrite_headers(data=0x0000000107119808, conn=0x00000001070e6e08, nread=0x00007ffeefbf9268, stop_reading=0x00007ffeefbf9227) at http.c:3568
   3565
   3566	          /* HTTP/2 cannot blacklist multiplexing since it is a core
   3567	             functionality of the protocol */
-> 3568	          conn->bundle->multiuse = BUNDLE_MULTIPLEX;
   3569	        }
   3570	        else if(conn->httpversion >= 11 &&
   3571	                !conn->bits.close) {
Target 0: (R) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000106a269b7 libcurl.4.dylib`Curl_http_readwrite_headers(data=0x0000000107119808, conn=0x00000001070e6e08, nread=0x00007ffeefbf9268, stop_reading=0x00007ffeefbf9227) at http.c:3568
    frame #1: 0x0000000106a49d7a libcurl.4.dylib`readwrite_data(data=0x0000000107119808, conn=0x00000001070e6e08, k=0x00000001071198c8, didwhat=0x00007ffeefbf9308, done=0x00007ffeefbf94bb, comeback=0x00007ffeefbf9446) at transfer.c:539
    frame #2: 0x0000000106a49626 libcurl.4.dylib`Curl_readwrite(conn=0x00000001070e6e08, data=0x0000000107119808, done=0x00007ffeefbf94bb, comeback=0x00007ffeefbf9446) at transfer.c:1125
    frame #3: 0x0000000106a5a8d2 libcurl.4.dylib`multi_runsingle(multi=0x000000010093a338, now=(tv_sec = 6713, tv_usec = 389430), data=0x0000000107119808) at multi.c:1909
    frame #4: 0x0000000106a59310 libcurl.4.dylib`curl_multi_perform(multi=0x000000010093a338, running_handles=0x00007ffeefbf9704) at multi.c:2176
    frame #5: 0x00000001042e22ea curl.so`R_multi_run(pool_ptr=0x0000000101d2a4f8, timeout=0x0000000101700320, max=0x0000000107959a70) at multi.c:203

@dtzWill
Copy link
Contributor

dtzWill commented Jun 22, 2018

@dtzWill Perhaps this is another unrelated problem. It happens randomly in a very complex web crawler application, so it's not trivial to produce a minimal C example.

Ahh, okay. Great, would have been difficult to proceed if the same test crashed for you but not me (or perhaps others).

The good news is I was finally able to create a proper backtrace:

Hooray! Hopefully this helps identify the problem :).

@bagder
Copy link
Member

bagder commented Jun 23, 2018

It looks as if conn->bundle is NULL there, when it really shouldn't...

@bagder
Copy link
Member

bagder commented Jun 23, 2018

I don't seem to be able to reproduce this on git master on my two primary linux dev machines. I've run the multi-stress.c example code numerous times on both.

@dtzWill
Copy link
Contributor

dtzWill commented Jun 23, 2018

I don't seem to be able to reproduce this on git master on my two primary linux dev machines. I've run the multi-stress.c example code numerous times on both.

I don't think the multi-stress.cpp example triggers this bug for anyone, if I understand the meaning of OP's comment:

@dtzWill Perhaps this is another unrelated problem. It happens randomly in a very complex web crawler application, so it's not trivial to produce a minimal C example.

@jeroen
Copy link
Contributor Author

jeroen commented Jun 23, 2018

Indeed, the original problem was solved your earlier fix. This is another crash that remains. It happens in my web crawler application after visiting a few hundred pages. I haven't been able to extract a small C example yet.

@bagder
Copy link
Member

bagder commented Jun 23, 2018

Ah, I actually have to read what you write as well! =)

@jeroen
Copy link
Contributor Author

jeroen commented Jun 23, 2018

The application is written in R. It uses one multi-handle, and the user gives it a starting url.

Then it repeatedly downloads the html file, extracts hyperlinks, and randomly visits a few of those links. If those links return text/html content, it recursively retrieves those, extracts links, etc etc.

Each request is done by a new fresh easy-handle, but they all share the same multi-handle. So you end up with a multi handle that has hundreds of connections to random servers.

The application still works fine with my OSX stock libcurl, but not with the 7.60+ versions.

@jeroen
Copy link
Contributor Author

jeroen commented Jul 3, 2018

I have created a C example now! Please see the example from my web crawler PR: #2706. For me the example is stable as-is but the crash is triggered when you uncomment the line that does:

curl_multi_setopt(multi_handle, CURLMOPT_MAX_HOST_CONNECTIONS, 6);

It doesn't consistently crash every time but with this parameter set it happens quite frequently.

@bagder
Copy link
Member

bagder commented Jul 3, 2018

the crash is triggered when you ...

Thanks. I'll dig in asap!

@jeroen
Copy link
Contributor Author

jeroen commented Jul 3, 2018

I tweaked the example in #2706 a bit more. This program is stable using the libcurl that ships with OSX high sierra (curl 7.54.0 (x86_64-apple-darwin17.0) libcurl/7.54.0 LibreSSL/2.0.20 zlib/1.2.11 nghttp2/1.24.0).

However using the curl master branch, there are several different places where it may crash. It's hard to narrow down why it happens but it appears frequently when we have many connections to different http servers.

Thanks for looking into it!

@bagder
Copy link
Member

bagder commented Jul 3, 2018

I can reproduce, and I'm on to something...

@bagder
Copy link
Member

bagder commented Jul 4, 2018

Turned out to be a bit hairy. I figured out some problems but when tightening the bolts around those I stepped in some other goo that needs cleanup for this to work fine...

@bagder
Copy link
Member

bagder commented Jul 4, 2018

@tatsuhiro-t I'm facing an issue here that I can't really explain after having starred on it for a while, so I just wanted to bounce it with you to hear if you might have any thought on A) what might cause this or perhaps B) some thought on what to check/debug next.

Here's what I observe:

With this crawler example code, we do a lot of parallel transfers and quite a lot of them are multiplexed h2. The crawler has an aggressive time-out on 5 seconds and every once in a while this time-out triggers, which makes curl to stop that particular transfer.

When curl stops a http2 transfer due to a time-out, it will (in the Curl_http2_done() function) send a RST_STREAM and then call nghttp2_session_set_stream_user_data() for that stream id to clear the association for that stream id to the easy handle of the transfer that just ended. This, to make sure that if there's any further incoming frames for that stream arriving, we will just ignore them since the handle for it is gone.

But here's the strange part: our test program appears to get such a "trailing" packet, a period after the clearing of the association mentioned above, and nghttp2 seems to still associate it with the former handle, returning to curl a handle that is no longer in the game !!

This leads to a crash here where 'stream' is used, which is an invalid pointer since the handle is not used for a transfer any longer so data_s points somewhere bad.

I've specifically logged the stream id => easy handle association when set to nghttp2 and I get a log output for when it is cleared and yet several lines below on_stream_close gets called and nghttp2_session_get_stream_user_data() then returns the old pointer still!

Here's some sample output from a typical crash:

Program received signal SIGSEGV, Segmentation fault.
0x0000555555613490 in on_stream_close (session=0x5555964549a0, stream_id=1951, error_code=5, 
    userp=0x55559648cfb8) at http2.c:847
847	    stream->closed = TRUE;
(gdb) p data_s
$1 = (struct Curl_easy *) 0x5555b7b2e0a8
(gdb) p stream
$2 = (struct HTTP *) 0x81

Searching backwards in the debug output I have, grepping for the specific handle involved in this crash 0x5555b7b2e0a8, I see:

...
http2.c:2044 stream id 1951 == handle 0x5555b7b2e0a8 set session 0x5555964549a0
....
Curl_http2_done: stream id 1951 handle 0x5555b7b2e0a8 cleared session 0x5555964549a0
...
[7440] Connection failure [Timeout was reached]: https://www.thomsonreuters.com/en/careers/connect.html?_ga=2.244972229.1957292823.1522066299-1074365959.1513192629 [0x5555b7b2e0a8]

The last line being the output from the crawler.c application, when libcurl has return a time-out error for that particular transfer. The on_stream_close crash then happens a short while after this. (there are no other lines in this period for that handle nor that stream id). It looks like nghttp2 sends back a handle where it shouldn't, but I also quickly glanced at the the nghttp2 code handling this and it seems unlikely that it would have an error like that. Or is there some queuing or other asynch stuff that could make this happen?

I've failed to reproduce this problem with just two connections where one gets timed out. I can reproduce it fairly easy but sometimes it takes upward 20,000 transfers for it to trigger! :-O

If anyone else has any bright ideas or suggestions, throw them my way!

@ThomasObenaus
Copy link

ThomasObenaus commented Jul 5, 2018

Hey @bagder as you mentioned, #2688 seams to have the same root-cause.

Here some more information from my investigations. Sorry for the wall of text.

I fear there is something wrong with the handling of this h2 (nghttp2_session) and somehow the stream is still in its stream-map?!
Especially this item->frame.hd.stream_id returning the old stream_id looks strange to me. I guess the element is still in the session when item = nghttp2_session_pop_next_ob_item(session); is called inside nghttp2_session_mem_send_internal.

Backtrace

#0  on_stream_close (session=..., stream_id=..., error_code=..., userp=...) at http2.c:882
#1  0x0000007f84cc1420 in nghttp2_session_close_stream (session=..., stream_id=..., error_code=...) at nghttp2_session.c:1162
#2  0x0000007f84cc579c in nghttp2_session_mem_send_internal (session=..., data_ptr=..., fast_cb=...) at nghttp2-1.17.0/lib/nghttp2_session.c:2945
#3  0x0000007f84cc6044 in nghttp2_session_send (session=...) at nghttp2_session.c:3215
#4  0x0000007f84d3177c in h2_session_send (data=..., h2=...) at http2.c:1526
#5  0x0000007f84d3246c in http2_send (conn=..., sockindex=..., mem=..., len=..., err=...) at http2.c:1862
#6  0x0000007f84d01228 in Curl_write (conn=..., sockfd=..., mem=..., len=..., written=...) at sendf.c:337
#7  0x0000007f84d127d4 in readwrite_upload (didwhat=..., conn=..., data=...) at transfer.c:1018
#8  Curl_readwrite (conn=..., data=..., done=..., comeback=...) at transfer.c:1134
#9  0x0000007f84d1c1e4 in multi_runsingle (multi=..., now=..., data=...) at multi.c:1914
#10 0x0000007f84d1cc2c in curl_multi_perform (multi=..., running_handles=...) at multi.c:2181
...

frame-5 0x0000007f84d3246c in http2_send (conn=..., sockindex=..., mem=..., len=..., err=...) at http2.c:1862

  • conn->data = 0x7f0c3dd148 (easy_handle)
  • conn->data->req.protop = 0x7f0d022af8 (stream HTTP)
  • conn->proto.httpc->h2 = 0x7f0cb08910 (session)
  • ((nghttp2_stream*)conn->proto.httpc->h2.streams.table[0x7f]).stream_user_data = 0x7f0c69c618 (easy_handle) <============ is invalid, was already freed

frame-4 0x0000007f84d3177c in h2_session_send (data=..., h2=...) at http2.c:1526

  • easy_handle = 0x7f0c3dd148
  • stream = 0x7f0d022af8
  • stream_id = 0x1db (475)

frame-3 0x0000007f84cc6044 in nghttp2_session_send (session=...) at nghttp2_session.c:3215

  • session = 0x7f0cb08910
  • session->user_data = 0x7f0cade0c8
  • session->next_stream_id = 0x233 (563)
  • session->last_sent_stream_id = 0xb1 (177)
  • session->last_recv_stream_id = 0x0
  • session->last_proc_stream_id = 0x0

frame-2 0x0000007f84cc579c in nghttp2_session_mem_send_internal (session=..., data_ptr=..., fast_cb=...) at nghttp2_session.c:2945

  • data_ptr = 0x7f4abf8d78
  • session = 0x7f0cb08910
  • session->next_stream_id = 0x233 (563)
  • session->last_sent_stream_id = 0xb1 (177)
  • session->last_recv_stream_id = 0x0
  • session->last_proc_stream_id = 0x0
  • item->frame.hd.type = NGHTTP2_HEADERS
  • item->frame.headers.cat == NGHTTP2_HCAT_REQUEST
  • item->frame.hd.stream_id = 0x16b (363) <=============== why this stream ??
  • item->aux_data.headers.error_code = 5

frame-1 0x0000007f84cc1420 in nghttp2_session_close_stream (session=..., stream_id=..., error_code=...) at nghttp2_session.c:1162

  • session = 0x7f0cb08910
  • session->next_stream_id = 0x233 (563)
  • session->last_sent_stream_id = 0xb1 (177)
  • session->last_recv_stream_id = 0x0
  • session->last_proc_stream_id = 0x0
  • stream_id = 0x16b (363) <=============== ??
  • session.streams.table[0x7f].next = 0x0 <=============== ??
  • session.streams.table[0x7f].key = 0x16b <=============== ??
  • stream = 0x7f0cb06f10
  • stream->state = NGHTTP2_STREAM_INITIAL
  • stream->stream_user_data = 0x7f0c69c618 (easy_handle) <============ is invalid, was already freed

frame-0 on_stream_close (session=..., stream_id=..., error_code=..., userp=...) at http2.c:882

  • data_s = 0x7f0c69c618 (easy_handle) <============ is invalid, was already freed
  • stream_id = 0x16b (363)
  • session = 0x7f0cb08910
  • session->next_stream_id = 0x233 (563)
  • session->last_sent_stream_id = 0xb1 (177)
  • session->last_recv_stream_id = 0x0
  • session->last_proc_stream_id = 0x0
  • stream = 0x279e730 <============ Invalid, CRASH, when accessing!!!!

easy_handle times out gets freed and on processing a second one the already vanished stream is closed

  • easy_handle A 0x7f0c69c618 uses stream 0x16b (363) and connection 9164 (0x7f0cade0c8)
  • easy_hanlde A reaches state PERFORM
  • easy_handle B 0x7f0c3dd148 uses stream 0x1db (475) and same connection 9164 (0x7f0cade0c8)
  • According to frame-5 of the call-stack http2_send on easy_handle B is called
  • easy_hanlde B reaches state PERFORM
  • In the meantime easy_handle A's stream is killed and thus the easy_handle is freed
  • Then in frame-2 item->frame.hd.stream_id = 0x16b (363) there seams this stream to be still there
  • And curl tries to close the stream --> CRASH!!
Log
07:40:56.222     [Warn] [user] nav: HTTPR: performRequest->curl_easy_init (curl_easy=0x7f0c69c618,req=0x7f04026f00,tid=3329)...done
07:40:56.222     [Warn] [user] nav: HTTPR: performRequest->curl_multi_add_handle (curl_easy=0x7f0c69c618,req=0x7f04026f00,tid=3329,cCmultiAdd=0)...done
07:40:56.276     [Warn] [user] nav: HTTPR: performRequest->curl_easy_init (curl_easy=0x7f0c3dd148,req=0x7f0404fa80,tid=3329)...done
07:40:56.276     [Warn] [user] nav: HTTPR: performRequest->curl_multi_add_handle (curl_easy=0x7f0c3dd148,req=0x7f0404fa80,tid=3329,cCmultiAdd=0)...done
07:40:56.333     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] STATE: INIT => CONNECT handle 0x7f0c69c618; line 1429 (connection #-5000)
07:40:56.333     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Conn: 9164 (0x7f0cade0c8) Receive pipe weight: (-1/0), penalized: FALSE
07:40:56.333     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Multiplexed connection found!
07:40:56.333     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Found connection 9164, with requests in the pipe (1)
07:40:56.334     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] STATE: CONNECT => DO handle 0x7f0c69c618; line 1474 (connection #9164)
07:40:56.334     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] http2_send len=1076
07:40:56.334     [Warn] [user] nav: CURL.NGHTTP2 nghttp2_submit_request session=0x7f0cb08910,stream_user_data=0x7f0c69c618,tid=3329
07:40:56.334     [Warn] [user] nav: CURL.NGHTTP2 submit_headers_shared_nva session=0x7f0cb08910,stream_user_data=0x7f0c69c618,tid=3329
07:40:56.334     [Warn] [user] nav: CURL.NGHTTP2 submit_headers_shared session=0x7f0cb08910,stream_user_data=0x7f0c69c618,tid=3329
**07:40:56.334     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Using Stream ID: 16b (easy handle 0x7f0c69c618)**
07:40:56.334     [Warn] [user] nav: [0x7f0c69c618] cURL [HEADER_OUT] POST /api/destin/v1/slde HTTP/2
07:40:56.335     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] STATE: DO => DO_DONE handle 0x7f0c69c618; line 1695 (connection #9164)
07:40:56.335     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] STATE: DO_DONE => WAITPERFORM handle 0x7f0c69c618; line 1822 (connection #9164)
07:40:56.335     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] STATE: WAITPERFORM => PERFORM handle 0x7f0c69c618; line 1837 (connection #9164)

07:40:56.377     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] STATE: INIT => CONNECT handle 0x7f0c3dd148; line 1429 (connection #-5000)
07:40:56.377     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] Conn: 9164 (0x7f0cade0c8) Receive pipe weight: (-1/0), penalized: FALSE
07:40:56.377     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] Multiplexed connection found!
07:40:56.377     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] Found connection 9164, with requests in the pipe (57)
07:40:56.377     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] STATE: CONNECT => DO handle 0x7f0c3dd148; line 1474 (connection #9164)
07:40:56.377     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c3dd148,magic=0xc0dedbad,tid=3329] http2_send len=1076
07:40:56.378     [Warn] [user] nav: CURL.NGHTTP2 nghttp2_submit_request session=0x7f0cb08910,stream_user_data=0x7f0c3dd148,tid=3329
07:40:56.378     [Warn] [user] nav: CURL.NGHTTP2 submit_headers_shared_nva session=0x7f0cb08910,stream_user_data=0x7f0c3dd148,tid=3329
07:40:56.378     [Warn] [user] nav: CURL.NGHTTP2 submit_headers_shared session=0x7f0cb08910,stream_user_data=0x7f0c3dd148,tid=3329
**07:40:56.378     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] Using Stream ID: 1db (easy handle 0x7f0c3dd148)**
07:40:56.378     [Warn] [user] nav: [0x7f0c3dd148] cURL [HEADER_OUT] POST /api/destin/v1/slde HTTP/2
07:40:56.378     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] STATE: DO => DO_DONE handle 0x7f0c3dd148; line 1695 (connection #9164)
07:40:56.378     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] STATE: DO_DONE => WAITPERFORM handle 0x7f0c3dd148; line 1822 (connection #9164)
07:40:56.378     [Warn] [user] nav: [0x7f0c3dd148] cURL [TEXT] STATE: WAITPERFORM => PERFORM handle 0x7f0c3dd148; line 1837 (connection #9164)

07:40:57.566     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] http2_recv: easy 0x7f0c69c618 (stream 363)
07:40:57.566     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] nread=9
07:40:57.566     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] nghttp2_session_mem_recv() returns 9
07:40:57.566     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] All data in connection buffer processed
07:40:57.566     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Kill stream: Transfer returned error
07:40:57.566     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] multi_done
**07:40:57.566     [Warn] [user] nav: CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] nghttp2_session_set_stream_user_data called in Curl_http2_done http->stream_id=363 (0x16b)**
07:40:57.569     [Info] [n.t.n.cloud.comm.client.HTTPRequestHandler] [0x7f0c69c618] Response received from curl.
07:40:57.569    [Error] [n.t.n.cloud.comm.client.HTTPRequestHandler] [0x7f0c69c618]CurlError: Failed sending data to the peer (37)
07:40:57.570    [Error] [n.t.n.cloud.comm.client.HTTPRequestHandler] [0x7f0c69c618]Error info: Operation timed out after 1361 milliseconds with 0 bytes received
07:40:57.570     [Info] [n.t.n.cloud.comm.client.HTTPRequestHandler] [0x7f0c69c618] httpRequestDone
07:40:57.582     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Connection still in use 0/98, no more multi_done now!
07:40:57.582     [Warn] [user] nav: HTTPR: main->curl_multi_info_read (curl_multi=0x31b9b968,numMessages=0,tid=3329,msg=0x7f0c69c680easy_handle=0x7f0c69c618msg.msg=1)...done
07:40:57.582     [Warn] [user] nav: HTTPR: httpRequestDone->curl_multi_remove_handle (curl_easy=0x7f0c69c618,req=0x7f04026f00,tid=3329)...
07:40:57.582     [Warn] [user] nav: [0x7f0c69c618] cURL [TEXT] Expire cleared
07:40:57.582     [Warn] [user] nav: HTTPR: httpRequestDone->curl_multi_remove_handle (curl_easy=0x7f0c69c618,req=0x7f04026f00,cCode=0,tid=3329)...done
07:40:57.582     [Warn] [user] nav: HTTPR: httpRequestDone->curl_easy_cleanup (curl_easy=0x7f0c69c618,req=0x7f04026f00,tid=3329)...
07:41:05.309    [Debug] [user] Watchdog-Event: /var/_sl_nav/default/usr/bin/xyz, run fault signal 11 PID 2764

Same easy_handle after init

The easy_handle where the http2_send starts with 0x7f0c3dd148 gets somehow reused. This means it was used, then cleaned up and in with the next curl_easy_init an easy_handle at the exact same position is created.

Log
HTTPR: performRequest->curl_easy_init (curl_easy=0x7f0c3dd148,req=0x7f0400b690,tid=3329)...done
[0x7f0c3dd148] Response received from curl.
[0x7f0c3dd148]CurlError: Couldn't connect to server (7)
HTTPR: httpRequestDone->curl_easy_cleanup (curl_easy=0x7f0c3dd148,req=0x7f0400b690,tid=3329)...done
...
HTTPR: performRequest->curl_easy_init (curl_easy=0x7f0c3dd148,req=0x7f0404fa80,tid=3329)...done
HTTPR: performRequest->curl_multi_add_handle (curl_easy=0x7f0c3dd148,req=0x7f0404fa80,tid=3329,cCmultiAdd=0)...done
...
Conclusion

Should work since calloc is used to allocated the memory (inits with 0).

Is the easy_handle removed from the stream?

The easy_handle 0x7f0c69c618 which is then taken and leads to the crash was freed correctly by calling nghttp2_session_set_stream_user_data about 8 seconds before the crash.

Log
07:40:56.222 HTTPR: performRequest->curl_easy_init (curl_easy=0x7f0c69c618,req=0x7f04026f00,tid=3329)...done
07:40:57.566 CURL.H2 [easy_handle=0x7f0c69c618,magic=0xc0dedbad,tid=3329] nghttp2_session_set_stream_user_data called in Curl_http2_done http->stream_id=363
07:40:57.569 [0x7f0c69c618] Response received from curl.
07:40:57.569 [0x7f0c69c618]CurlError: Failed sending data to the peer (37)
07:40:57.570 [0x7f0c69c618]Error info: Operation timed out after 1361 milliseconds with 0 bytes received
07:41:05.309    [Debug] [user] Watchdog-Event: /var/_sl_nav/default/usr/bin/xyz, run fault signal 11 PID 2764
Conclusion

Should work since nghttp2_session_set_stream_user_data is called.

@tatsuhiro-t
Copy link
Contributor

tatsuhiro-t commented Jul 5, 2018

I can reproduce the crash, and ASAN showed the same backtrace as #2674 (comment)

I could not see on_stream_close is called and it gets non NULL stream user data after Curl_http2_done is called. What I saw is opposite, on_stream_close is called and then Curl_http2_done is called. I'm not sure it can cause a problem.

conn->bundle was set to NULL well before the crash was observed. It was set to NULL by Curl_conncache_extract_oldest.

Does curl use native thread?

@gvanem
Copy link
Contributor

gvanem commented Jul 5, 2018

@jeroen
I tried your crawler.c example. And it crashes inside libcurl!Curl_http_readwrite_headers on Windows-10 too. At http.c / line 3569:

          /* HTTP/2 cannot blacklist multiplexing since it is a core
             functionality of the protocol */
          conn->bundle->multiuse = BUNDLE_MULTIPLEX;  << !! crash

Above conn->bundle is NULL. A simple if (conn->bundle) first, fixes that crash, but then it crashes on some XML function.

@bagder
Copy link
Member

bagder commented Jul 5, 2018

@gvanem we're using that example to repro this issue.

@tatsuhiro-t yeah, I've actually fixed a few other problems in my local branch to end up where I am, the conn->bundle is one of them.

My current dev branch is now available at https://github.com/curl/curl/tree/bagder/2674-http2-segfault note that it includes a few crude fprintf()s and and asserts that will be removed before this gets merged anywhere.

@bagder
Copy link
Member

bagder commented Jul 12, 2018

Status update

I merged a few fixes (#2733) I've found so far, but that still just changed the nature of the problems. The crawler code still crashes reproducibly. The hard part is that it takes a whole lot of transfers before it crashes so there's a lot going on at the moment which makes it difficult to trace the origin of the problem and not just work on the results of a bug done "a while ago".

@bagder
Copy link
Member

bagder commented Jul 14, 2018

I think I've found the most problematic issues. It's (again) an issue where things are wrongly modified in the easy handle when a connection/stream is handled, so at the end of a stream it resets things in the handle that causes havoc for other streams on the same connection.

I'm now testing my fixes an extra round and making sure that all current tests also keep running. I hope to push another PR with this work early next week.

@jeroen
Copy link
Contributor Author

jeroen commented Jul 14, 2018

That's great news! Thanks for getting to the bottom of this.

@ThomasObenaus
Copy link

@bagder yes that's really cool to hear.
Do you eventually have these fixes on a branch I can use for testing as well? Maybe the #2688 is then solved too.

@bagder
Copy link
Member

bagder commented Jul 16, 2018

They will be merged into master. I'll let you know when it's time! (I've been out and about a bit the past few days)

@ReneReusner
Copy link

ReneReusner commented Jul 16, 2018

Hello

I'm from the same company as ThomasObenaus, and helped debugging the crash too.

The crash resulted in accessing a former valid, but now invalid (deleted) Curl_easy handle. After a timeout thomas did a curl_multi_remove_handle and a curl_easy_cleanup.
Sometimes the curl_multi_remove_handle did not reset the stream_userdata to zero for this stream id.

When curl wants to close the session, or just find a frame with this now invalid stream id in the session, albeit handling a another Curl_easy handle, it called a callback with a invalid Curl_easy handle -> BOOM,

My fast workaround is, track the session and stream_id in Curl_easy and clear the stream_userdata it when removing the handle in curl_multi_remove_handle. Its just a bell wire around the engine block, but it seems to work for a now. But i prefer a proper bugfix too :-).

@bagder
Copy link
Member

bagder commented Jul 16, 2018

@ReneReusner I already landed a few fixes and there are more coming. Once I've landed these fixes I will be very interested in further debugging and testing to make sure we iron out all the last wrinkles.

@bagder
Copy link
Member

bagder commented Jul 20, 2018

I just landed #2751 as commit 7b9bc96 and now the crashes are if not gone, at least much less likely to occur.

@jeroen
Copy link
Contributor Author

jeroen commented Jul 22, 2018

Thanks, looking good. Are you still planning on a patch release soon?

@ThomasObenaus
Copy link

Sadly it does not solve #2688. I got the same bt even with the fixes of 7b9bc96.

@jeroen
Copy link
Contributor Author

jeroen commented Jul 25, 2018

@ThomasObenaus Did you try 81377ca ?

@ThomasObenaus
Copy link

@jeroen I tried with current master including 81377ca ,sadly no success :-(

@bagder
Copy link
Member

bagder commented Jul 25, 2018

Let me remind you that I can't reproduce your problem. With @jeroen's example I managed to find and fix a whole slew of issues, but as they've gradually been fixed that sample program now works really well. I'm not entirely convinced it runs 100% fine all the time yet, but without a way to reproduce your case @ThomasObenaus, we risk having to live with that until someone can reproduce it.

@ThomasObenaus
Copy link

@bagder I totally understand this. But yet I found no other way to reproduce than the one I described (which is not transferable to you).

Maybe one last question:
In the logs I see that, when cleaning up the easy_handle, it fails on removing itself from the stream (the call to nghttp2_session_set_stream_user_data fails). And then later when another stream is closed which still contains a reference to that easy_handle it crashes.
I have the feeling that there is sometimes a situation where the easy_handle and the h2-session does not match or get mixed.
Could that be the case? Or can you maybe point me at a location which I should look at for this potential problem? I'm a bit lost in the curl-code since its new for me.

@bagder
Copy link
Member

bagder commented Jul 28, 2018

Okay while #2688 is still happening, @jeroen has confirmed his problems are not showing up anymore so I consider this particular issue fixed.

@bagder bagder closed this as completed Jul 28, 2018
@ReneReusner
Copy link

ReneReusner commented Aug 11, 2018

Hello.

I managed to get some hours at work to look in this problem again. What i had found: On a curl time out in multi.c (~1400)

      timeout_ms = Curl_timeleft(data, &now,
                                 (data->mstate <= CURLM_STATE_WAITDO)?
                                 TRUE:FALSE);

      if(timeout_ms < 0) {
        /* Handle timed out */
............

        /* Force connection closed if the connection has indeed been used */
        if(data->mstate > CURLM_STATE_DO) {
          streamclose(data->easy_conn, "Disconnected with pending data");
          stream_error = TRUE;
        }
        result = CURLE_OPERATION_TIMEDOUT;
        (void)multi_done(&data->easy_conn, result, TRUE);
        /* Skip the statemachine and go directly to error handling section. */
        goto statemachine_end;

The multi_done here

static CURLcode multi_done(struct connectdata **connp,
                          CURLcode status,  /* an error if this is called
                                               after an error was detected */
                          bool premature)

needs to clear the user data in the associated stream. If it don't to it, then sometimes when nghttp2 iterate over all stream in a session, sometimes it will find a "stale" user_data and may call a callback with a deleted Curl_easy -> BOOM.

I added it in

void Curl_http2_done(struct connectdata *conn, bool premature)
{

Please also take a look in multi.c ~ 1350

static CURLMcode multi_runsingle(struct Curl_multi *multi,
                                 struct timeval now,
                                 struct Curl_easy *data)
...
      data->state.pipe_broke = FALSE;
      data->easy_conn = NULL;
      /* My comment for a debug assertion
         clear http2 session infos, the http2 session is already gone,
         but the assert in http2.c:1979
         assert(!(conn->data->tmp1) || (conn->data->tmp1 == h2)); & co
         can still bite us, if we get a new connection to continue */

On a broken pipe, Cause i'm not familiar with curl, i don't know if the associated connection can be still used? If yes (maybe its placed in the connection cache), we need to remove the stream_userdata here too.

One another thing. Its not a error but can break on 64bit machines a backtrace in gdb:
sendf.c ~ line 865:

  if(data->set.fdebug) {
    return curlDebugCallback(data, type, ptr, size,  data->set.debugdata);
    /*return (*data->set.fdebug)(data, type, ptr, size,

For example on a stale Curl_easy, set.fdebug points to a invalid address, when the cpu jump to it, it crashes yes. BUT (cause of some brain damaged ABI designers, redzone, framepointer optional) the gdb needs a "valid" PC to get the unwind information. If its not in a known region -> no backtrace. Then you need a walk_stack macro. Callbacks for debugging on 64bit machines are not always a good idea.
A callback which point to a invalid address (for example from a old stale Curl_easy) can result in a broken backtrace.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

7 participants