Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curl infinite loop calling CURLMOPT_TIMERFUNCTION callback #3537

Closed
rgoomes opened this issue Feb 8, 2019 · 16 comments
Closed

curl infinite loop calling CURLMOPT_TIMERFUNCTION callback #3537

rgoomes opened this issue Feb 8, 2019 · 16 comments
Labels

Comments

@rgoomes
Copy link

rgoomes commented Feb 8, 2019

I did this

Recently, we got into a situation where curl is calling our CURLMOPT_TIMERFUNCTION callback in loop and then in crashes possible due to stack buffer overflow. The stack trace looks something like this:

Thread 12 Crashed:
0   libsystem_malloc.dylib        	0x00000001fc61d974 nanov2_allocate_from_block$VARIANT$mp + 4
1   libsystem_malloc.dylib        	0x00000001fc61ce1c nanov2_allocate$VARIANT$mp + 124
2   libsystem_malloc.dylib        	0x00000001fc61cd50 nanov2_malloc$VARIANT$mp + 60
3   libsystem_malloc.dylib        	0x00000001fc62bc08 malloc_zone_malloc + 156
4   libsystem_malloc.dylib        	0x00000001fc62c618 malloc + 28
5   joyn v2                       	0x0000000102a249a0 alloc_addbyter.llvm.3849878086575379832 + 132
6   joyn v2                       	0x0000000102a245ec dprintf_formatf.llvm.3849878086575379832 + 3920
7   joyn v2                       	0x0000000102a248b0 curl_maprintf + 60
8   joyn v2                       	0x0000000102a1590c create_hostcache_id.llvm.11373449002202608968 + 32
9   joyn v2                       	0x0000000102a15564 fetch_addr.llvm.11373449002202608968 + 40
10  joyn v2                       	0x0000000102a154ec Curl_fetch_addr + 92
11  joyn v2                       	0x0000000102a2bc68 multi_runsingle + 2012
12  joyn v2                       	0x0000000102a2d17c curl_multi_socket_action + 392
13  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
14  joyn v2                       	0x0000000102a2a608 update_timer + 144
15  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
16  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
17  joyn v2                       	0x0000000102a2a608 update_timer + 144
18  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
19  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
20  joyn v2                       	0x0000000102a2a608 update_timer + 144
21  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
22  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
23  joyn v2                       	0x0000000102a2a608 update_timer + 144
24  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
25  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
26  joyn v2                       	0x0000000102a2a608 update_timer + 144
27  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
28  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
29  joyn v2                       	0x0000000102a2a608 update_timer + 144
30  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
31  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
32  joyn v2                       	0x0000000102a2a608 update_timer + 144
33  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
34  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
35  joyn v2                       	0x0000000102a2a608 update_timer + 144
36  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
37  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
38  joyn v2                       	0x0000000102a2a608 update_timer + 144
39  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
40  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
41  joyn v2                       	0x0000000102a2a608 update_timer + 144
42  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
43  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
44  joyn v2                       	0x0000000102a2a608 update_timer + 144
45  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
46  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
47  joyn v2                       	0x0000000102a2a608 update_timer + 144
48  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
49  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
50  joyn v2                       	0x0000000102a2a608 update_timer + 144
51  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
52  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
53  joyn v2                       	0x0000000102a2a608 update_timer + 144
54  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
55  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
56  joyn v2                       	0x0000000102a2a608 update_timer + 144
57  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
58  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
59  joyn v2                       	0x0000000102a2a608 update_timer + 144
60  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
61  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
62  joyn v2                       	0x0000000102a2a608 update_timer + 144
63  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
64  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
65  joyn v2                       	0x0000000102a2a608 update_timer + 144
66  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
67  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
68  joyn v2                       	0x0000000102a2a608 update_timer + 144
69  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
70  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
71  joyn v2                       	0x0000000102a2a608 update_timer + 144
72  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
73  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
74  joyn v2                       	0x0000000102a2a608 update_timer + 144
75  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
76  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
77  joyn v2                       	0x0000000102a2a608 update_timer + 144
78  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
79  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
80  joyn v2                       	0x0000000102a2a608 update_timer + 144
81  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
82  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
83  joyn v2                       	0x0000000102a2a608 update_timer + 144
84  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
85  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
86  joyn v2                       	0x0000000102a2a608 update_timer + 144
87  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
88  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
89  joyn v2                       	0x0000000102a2a608 update_timer + 144
90  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
91  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
92  joyn v2                       	0x0000000102a2a608 update_timer + 144
93  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
94  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
95  joyn v2                       	0x0000000102a2a608 update_timer + 144
96  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
97  joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120
98  joyn v2                       	0x0000000102a2a608 update_timer + 144
99  joyn v2                       	0x0000000102a2d1ec curl_multi_socket_action + 504
100 joyn v2                       	0x0000000102510dd0 NetworkManager::curlMultiTimerCB+ 30821840 (void*, long, NetworkManager*) + 120

and it goes forever until it crashes. Our timer callback is very similar to multi_timer_cb callback of asiohiper. The callback is being called in loop with timeout_ms = 0. We are unable to replicate the issue, but it already happened 3 times in a week and it is always in iOS.

curl/libcurl version

7.63.0 release

operating system

iOS only, iPhone 7, iOS 12, 64-bit

additional information

it is not reproducible with libcurl 7.58.0

@bagder bagder added the crash label Feb 8, 2019
@bagder
Copy link
Member

bagder commented Feb 8, 2019

Without a way to reproduce or figure out the internal state at the time of the crash this is really hard for us to handle. libcurl has a system to avoid calling the callback again for the same timeout. This rather looks like the time tree is damaged or something.

Have you made sure that you're not accessing any libcurl handles simultaneously from more than one thread?

@rgoomes
Copy link
Author

rgoomes commented Feb 8, 2019

Without a way to reproduce or figure out the internal state at the time of the crash this is really hard for us to handle.

Yes, I am aware of that. Unfortunately, this appears to happen randomly

libcurl has a system to avoid calling the callback again for the same timeout. This rather looks like the time tree is damaged or something.

I had the opportunity to look into curl code. From my understanding I can't see how that mechanism will stop the callback from being called multiple times due to this piece of code:

else {
    /* Asked to run due to time-out. Clear the 'lastcall' variable to force
       update_timer() to trigger a callback to the app again even if the same
       timeout is still the one to run after this call. That handles the case
       when the application asks libcurl to run the timeout prematurely. */
    memset(&multi->timer_lastcall, 0, sizeof(multi->timer_lastcall));
}

maybe it is handled somewhere else?

Have you made sure that you're not accessing any libcurl handles simultaneously from more than one thread?

Yes, libcurl is running in 1 thread only and on top of that it is stranded. This started to occur when we upgraded libcurl from version 7.58.0 to 7.63.0, not sure if it is related.

@bagder
Copy link
Member

bagder commented Feb 11, 2019

I had the opportunity to look into curl code. From my understanding I can't see how that mechanism will stop the callback from being called multiple times due to this piece of code:

Sure, but that logic is there to handle when libcurl is called to handle a timeout before the timeout has actually expired internally. Since before that time, the necessary actions won't be performed and thus libcurl wants to get told again when the timeout actually has expired. That shouldn't be possible to occur in a loop, unless I'm missing something.

Additionally: even when this happens in a rapid loop (which of course it shouldn't), that still shouldn't be a reason for a crash. You seem to call libcurl back again in the callback that curl tells you about the new timeout in. What if you don't do that and instead call it using the regular timeout handling so that you don't get this recursive effect? Or as a test, consider "now" (the zero value) to be 1 millisecond and see if that changes anything. libcurl counts expiry based on milliseconds internally, could that loop happen within the same millisecond? Do we get at least millisecond accuracy on the timer in iOS?

This started to occur when we upgraded libcurl from version 7.58.0 to 7.63.0

Lots of things have changed in curl in that time period, but we haven't changed any behavior in this area - at least not on purpose.

@rgoomes
Copy link
Author

rgoomes commented Feb 11, 2019

Additionally: even when this happens in a rapid loop (which of course it shouldn't), that still shouldn't be a reason for a crash. You seem to call libcurl back again in the callback that curl tells you about the new timeout in. What if you don't do that and instead call it using the regular timeout handling so that you don't get this recursive effect?

This is a copy of the CURLMOPT_TIMERFUNCTION callback:

//CURLMOPT_TIMERFUNCTION
static int curlMultiTimerCB(CURLM *multi, long timeoutMS, NetworkManager* nmanager)
{
	nmanager->m_asioTimer.cancel();

	if (timeoutMS > 0)
	{
		nmanager->m_asioTimer.expires_from_now(boost::posix_time::millisec(timeoutMS));
		nmanager->m_asioTimer.async_wait([nmanager](const boost::system::error_code& error)
		{
			if (error)
				return;

			int stillRunningHandles;
			auto ccode = curl_multi_socket_action(nmanager->m_curlMulti, CURL_SOCKET_TIMEOUT, 0, &stillRunningHandles);

			std::lock_guard<std::mutex> lock(nmanager->m_mutex);
			nmanager->checkCURLHandles();
		});

		return 0;
	}

	if (timeoutMS == 0) //success
	{
		int stillRunningHandles;
		auto ccode = curl_multi_socket_action(nmanager->m_curlMulti, CURL_SOCKET_TIMEOUT, 0, &stillRunningHandles);

		nmanager->checkCURLHandles();
	}

	return 0;
}

can you identify something wrong in the code? anything that can be improved? checkCURLHandles
checks for completed transfers to cleanup curl handles, that is, it calls in loop curl_multi_info_read in search for msg == CURLMSG_DONE

Or as a test, consider "now" (the zero value) to be 1 millisecond and see if that changes anything.

Note that curl documentation tells that with a timeout_ms value of 0 we should call curl_multi_socket_action and that's exactly what we are doing as shown in the above code. In such cases curl 7.63.0 will fallback to that else which memsets the multi->timer_lastcall. Also, in local tests seems that it is expected for curl to call the callback multiple times (~10 consecutive times) with timeout_ms = 0 but I guess this is curl detecting other successful requests.

libcurl counts expiry based on milliseconds internally, could that loop happen within the same millisecond? Do we get at least millisecond accuracy on the timer in iOS?

I think it is up to curl to ensure that millisecond accuracy is guaranteed in iOS. This is the thing that bothers me the most, this loop is not happening in Android nor Windows for some reason.

Reference: https://curl.haxx.se/libcurl/c/CURLMOPT_TIMERFUNCTION.html

@TvdW
Copy link

TvdW commented Feb 12, 2019

(Was digging into a different thing but I noticed this issue and know the answer so I may as well answer)

Note that curl documentation tells that with a timeout_ms value of 0 we should call curl_multi_socket_action and that's exactly what we are doing as shown in the above code.

This may be a misinterpretation and/or miswording of the documentation. You're supposed to call the function from the next iteration of your event loop, instead of calling it immediately. The same thing would happen if you actually did set a 0 ms timer, and the documentation could be worded as "0 is not a special value: a timer should be set whenever timeout_ms is not -1".

@bagder
Copy link
Member

bagder commented Feb 12, 2019

This may be a misinterpretation and/or miswording of the documentation

Maybe, but calling back like that is done by example already and should probably work. I asked @rgoomes to try the other approach just to see if that changes anything or not.

@rgoomes wrote:

I think it is up to curl to ensure that millisecond accuracy is guaranteed in iOS.

I don't know iOS and I don't know what time accuracy it provides. I was hoping someone who works on that platform could just confirm it for us. libcurl is highly portable and runs on way more platforms than any of us in the core team owns or masters ourselves...

@rgoomes
Copy link
Author

rgoomes commented Feb 15, 2019

This may be a misinterpretation and/or miswording of the documentation

Maybe, but calling back like that is done by example already and should probably work. I asked @rgoomes to try the other approach just to see if that changes anything or not.

Yes, assuming a timeout of 0 as 1 millisecond apparently fixes the problem. However, I don't like this workaround of delaying the process by 1 millisecond.

@jay
Copy link
Member

jay commented Feb 17, 2019

Have you made sure that you're not accessing any libcurl handles simultaneously from more than one thread?

Yes, libcurl is running in 1 thread only and on top of that it is stranded. This started to occur when we upgraded libcurl from version 7.58.0 to 7.63.0, not sure if it is related.

Just to be clear here you're saying basically you have asynchronous callback but they are set to run sequentially in the same thread (ie stranded)? Like what thread is the async call to curl_multi_socket_action and who's to say that's not also running in another thread and where is the synchronization? We don't guarantee accessing the same handle from more than one thread at a time, libcurl doesn't have internal thread synchronization you must provide that on your own. Review thread safety. This is the only report I've heard of this and so I think it may be your code. If this is entirely reproducible you can try bisecting it.

@bagder
Copy link
Member

bagder commented Feb 17, 2019

Yes, assuming a timeout of 0 as 1 millisecond apparently fixes the problem. However, I don't like this workaround of delaying the process by 1 millisecond.

Next attempt I suggest would be to follow @TvdW's advice and use the 0 as a normal timeout as well, so that it'll just be scheduled to run immediately next, but not called back directly from within the callback. That will avoid the recursive risk and at the same time avoid a 1 millisecond delay.

@rgoomes
Copy link
Author

rgoomes commented Feb 21, 2019

Yes, assuming a timeout of 0 as 1 millisecond apparently fixes the problem. However, I don't like this workaround of delaying the process by 1 millisecond.

Next attempt I suggest would be to follow @TvdW's advice and use the 0 as a normal timeout as well, so that it'll just be scheduled to run immediately next, but not called back directly from within the callback. That will avoid the recursive risk and at the same time avoid a 1 millisecond delay.

That approach also seems to work. Apparently, any approach that avoids the recursive risk is enough to fix the loop. However, everyone seems to be doing exactly the opposite, at least implementations using boost asio with curl, that is, they call immediately curl_multi_socket_action when the timeout is 0. I'm still not sure if this is wrong or otherwise there's a bug in libcurl.

@bagder
Copy link
Member

bagder commented Feb 22, 2019

everyone seems to be doing exactly the opposite, at least implementations using boost asio with curl, that is, they call immediately curl_multi_socket_action when the timeout is 0. I'm still not sure if this is wrong or otherwise there's a bug in libcurl.

There are several possible explanations to why you report this and not others:

  1. it could just be that nobody else has reported it, even if the bug happened for them
  2. it might be that the iOS timer isn't reporting full millisecond resolution and thus we're more likely to get called for the same time over and over, which means recursively if called from within the callback
  3. there might be a smaller stack on iOS than other platforms using the event-based API, thus making it more "vulnerable" for deep recursive calls
  4. there might be a libcurl bug or two that makes it get stuck on the same millisecond, that we haven't detected

We would need more debugging to figure out the truth.

Personally I'm not a fan of calling libcurl back from within its own callbacks. I think I'll update the examples doing this to stop them from encouraging that.

bagder added a commit that referenced this issue Feb 22, 2019
From within the timer callbacks. Recursive is problematic for several
reasons.

Discussed in #3537
@jay
Copy link
Member

jay commented Feb 22, 2019

I'm still not sure if this is wrong or otherwise there's a bug in libcurl.

To reiterate "you must never use a single handle from more than one thread at any given time." If your async call is in a different thread then that could cause this issue. AFAIK forcing consecutive execution of the calls (what I assume you mean by stranded) doesn't guarantee they're all in the same thread, that will depend on your code. If you were going to debug this further I'd record the thread id/. Like why do you need that guard in the async function but not otherwise

bagder added a commit that referenced this issue Feb 23, 2019
From within the timer callbacks. Recursive is problematic for several
reasons. They should still work, but this way the examples and the
documentation becomes simpler. I don't think we need to encourage
recursive calls.

Discussed in #3537
Closes #3601
@bagder
Copy link
Member

bagder commented Apr 14, 2019

I will suggest that we close this issue because we have not managed to reproduce it and thus nobody is working on a fix, there's a working work-around provided that makes it not trigger anymore and the examples are now modified to not encourage the pattern that triggered this issue in the first place.

@rgoomes
Copy link
Author

rgoomes commented Apr 14, 2019

Yes, you can close it. No more issues were found with the suggested advice.

@bagder bagder closed this as completed Apr 15, 2019
Totktonada added a commit to tarantool/tarantool that referenced this issue Apr 27, 2019
When libcurl is built with --enable-threaded-resolver (which is default)
and the version of the library is 7.60 or above, libcurl calls a timer
callback with exponentially increasing timeout_ms value during DNS
resolving.

This behaviour was introduced in curl-7_59_0-36-g67636222f (see [1],
[2]). During first ten milliseconds the library sets a timer to a passed
time divided by three (see Curl_resolver_getsock()). It is possible that
passed time is zero during at least several thousands of iterations.

Before this commit we didn't set a libev timer in curl_multi_timer_cb()
when a timeout_ms value is zero, but call curl_multi_process()
immediately. Libcurl however can call curl_multi_timer_cb() again and
here we're going into a recursion that stops only when timeous_ms
becomes positive. Often we generate several thousands of stack frames
within this recursion and exceed 512KiB of a fiber stack size.

The fix is easy: always set a libev timer to call curl_multi_process()
even when a timeout_ms value is zero.

The reason why we did the call to curl_multi_process() immediately is
the unclear wording in the CURLMOPT_TIMERFUNCTION option documentation.
This documentation page was fixed in curl-7_64_0-88-g47e540df8 (see [3],
[4], [5]).

There is also the related change in curl-7_60_0-121-g3ef67c686 (see [6],
[7]): after this commit libcurl calls a timer callback with zero
timeout_ms during a first three milliseconds of asynchronous DNS
resolving.

Fixes #4179.

[1]: curl/curl#2419
[2]: curl/curl@6763622
[3]: curl/curl#3537
[4]: curl/curl#3601
[5]: curl/curl@47e540d
[6]: curl/curl#2685
[7]: curl/curl@3ef67c6
Totktonada added a commit to tarantool/tarantool that referenced this issue Apr 28, 2019
When libcurl is built with --enable-threaded-resolver (which is default)
and the version of the library is 7.60 or above, libcurl calls a timer
callback with exponentially increasing timeout_ms value during DNS
resolving.

This behaviour was introduced in curl-7_59_0-36-g67636222f (see [1],
[2]). During first ten milliseconds the library sets a timer to a passed
time divided by three (see Curl_resolver_getsock()). It is possible that
passed time is zero during at least several thousands of iterations.

Before this commit we didn't set a libev timer in curl_multi_timer_cb()
when a timeout_ms value is zero, but call curl_multi_process()
immediately. Libcurl however can call curl_multi_timer_cb() again and
here we're going into a recursion that stops only when timeous_ms
becomes positive. Often we generate several thousands of stack frames
within this recursion and exceed 512KiB of a fiber stack size.

The fix is easy: set a libev timer to call curl_multi_process() even
when a timeout_ms value is zero.

The reason why we did the call to curl_multi_process() immediately is
the unclear wording in the CURLMOPT_TIMERFUNCTION option documentation.
This documentation page was fixed in curl-7_64_0-88-g47e540df8 (see [3],
[4], [5]).

There is also the related change in curl-7_60_0-121-g3ef67c686 (see [6],
[7]): after this commit libcurl calls a timer callback with zero
timeout_ms during a first three milliseconds of asynchronous DNS
resolving.

Fixes #4179.

[1]: curl/curl#2419
[2]: curl/curl@6763622
[3]: curl/curl#3537
[4]: curl/curl#3601
[5]: curl/curl@47e540d
[6]: curl/curl#2685
[7]: curl/curl@3ef67c6
locker pushed a commit to tarantool/tarantool that referenced this issue Apr 29, 2019
When libcurl is built with --enable-threaded-resolver (which is default)
and the version of the library is 7.60 or above, libcurl calls a timer
callback with exponentially increasing timeout_ms value during DNS
resolving.

This behaviour was introduced in curl-7_59_0-36-g67636222f (see [1],
[2]). During first ten milliseconds the library sets a timer to a passed
time divided by three (see Curl_resolver_getsock()). It is possible that
passed time is zero during at least several thousands of iterations.

Before this commit we didn't set a libev timer in curl_multi_timer_cb()
when a timeout_ms value is zero, but call curl_multi_process()
immediately. Libcurl however can call curl_multi_timer_cb() again and
here we're going into a recursion that stops only when timeous_ms
becomes positive. Often we generate several thousands of stack frames
within this recursion and exceed 512KiB of a fiber stack size.

The fix is easy: set a libev timer to call curl_multi_process() even
when a timeout_ms value is zero.

The reason why we did the call to curl_multi_process() immediately is
the unclear wording in the CURLMOPT_TIMERFUNCTION option documentation.
This documentation page was fixed in curl-7_64_0-88-g47e540df8 (see [3],
[4], [5]).

There is also the related change in curl-7_60_0-121-g3ef67c686 (see [6],
[7]): after this commit libcurl calls a timer callback with zero
timeout_ms during a first three milliseconds of asynchronous DNS
resolving.

Fixes #4179.

[1]: curl/curl#2419
[2]: curl/curl@6763622
[3]: curl/curl#3537
[4]: curl/curl#3601
[5]: curl/curl@47e540d
[6]: curl/curl#2685
[7]: curl/curl@3ef67c6
locker pushed a commit to tarantool/tarantool that referenced this issue Apr 29, 2019
When libcurl is built with --enable-threaded-resolver (which is default)
and the version of the library is 7.60 or above, libcurl calls a timer
callback with exponentially increasing timeout_ms value during DNS
resolving.

This behaviour was introduced in curl-7_59_0-36-g67636222f (see [1],
[2]). During first ten milliseconds the library sets a timer to a passed
time divided by three (see Curl_resolver_getsock()). It is possible that
passed time is zero during at least several thousands of iterations.

Before this commit we didn't set a libev timer in curl_multi_timer_cb()
when a timeout_ms value is zero, but call curl_multi_process()
immediately. Libcurl however can call curl_multi_timer_cb() again and
here we're going into a recursion that stops only when timeous_ms
becomes positive. Often we generate several thousands of stack frames
within this recursion and exceed 512KiB of a fiber stack size.

The fix is easy: set a libev timer to call curl_multi_process() even
when a timeout_ms value is zero.

The reason why we did the call to curl_multi_process() immediately is
the unclear wording in the CURLMOPT_TIMERFUNCTION option documentation.
This documentation page was fixed in curl-7_64_0-88-g47e540df8 (see [3],
[4], [5]).

There is also the related change in curl-7_60_0-121-g3ef67c686 (see [6],
[7]): after this commit libcurl calls a timer callback with zero
timeout_ms during a first three milliseconds of asynchronous DNS
resolving.

Fixes #4179.

[1]: curl/curl#2419
[2]: curl/curl@6763622
[3]: curl/curl#3537
[4]: curl/curl#3601
[5]: curl/curl@47e540d
[6]: curl/curl#2685
[7]: curl/curl@3ef67c6

(cherry picked from commit 47bd51b)
locker pushed a commit to tarantool/tarantool that referenced this issue Apr 29, 2019
When libcurl is built with --enable-threaded-resolver (which is default)
and the version of the library is 7.60 or above, libcurl calls a timer
callback with exponentially increasing timeout_ms value during DNS
resolving.

This behaviour was introduced in curl-7_59_0-36-g67636222f (see [1],
[2]). During first ten milliseconds the library sets a timer to a passed
time divided by three (see Curl_resolver_getsock()). It is possible that
passed time is zero during at least several thousands of iterations.

Before this commit we didn't set a libev timer in curl_multi_timer_cb()
when a timeout_ms value is zero, but call curl_multi_process()
immediately. Libcurl however can call curl_multi_timer_cb() again and
here we're going into a recursion that stops only when timeous_ms
becomes positive. Often we generate several thousands of stack frames
within this recursion and exceed 512KiB of a fiber stack size.

The fix is easy: set a libev timer to call curl_multi_process() even
when a timeout_ms value is zero.

The reason why we did the call to curl_multi_process() immediately is
the unclear wording in the CURLMOPT_TIMERFUNCTION option documentation.
This documentation page was fixed in curl-7_64_0-88-g47e540df8 (see [3],
[4], [5]).

There is also the related change in curl-7_60_0-121-g3ef67c686 (see [6],
[7]): after this commit libcurl calls a timer callback with zero
timeout_ms during a first three milliseconds of asynchronous DNS
resolving.

Fixes #4179.

[1]: curl/curl#2419
[2]: curl/curl@6763622
[3]: curl/curl#3537
[4]: curl/curl#3601
[5]: curl/curl@47e540d
[6]: curl/curl#2685
[7]: curl/curl@3ef67c6

(cherry picked from commit 47bd51b)
@rgoomes
Copy link
Author

rgoomes commented May 2, 2019

@bagder are you aware of the analysis in issue tarantool/tarantool#4179 ? According to the issue, this commit #2419 seems to be the culprit of this behavior. Can you please confirm? Full analysis is in tarantool/tarantool@70821c6

@bagder
Copy link
Member

bagder commented May 3, 2019

Yes, that seems to be entirely correct and a fine analysis. The reason is quite simply the "optmization" I did in #2419 that now makes it possible to call the timer function again with a zero millisecond timeout and for many systems calling libcurl back again recursively will then risk this kind of stack smash.

I can't figure out a sensible or easy fix for this in another way than the way we already discussed and have documented.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

4 participants