Test 1564 failing intermittently on illumos #5037

citrus-it · 2020-03-04T21:28:22Z

I did this

Upgrading to curl 7.69.0 on OmniOS (an illumos distribution)

I expected the following

No new failing tests. The curl_multi_poll test (1564) is failing around 7/10 of the time with:

lib1564.c:128 curl_multi_poll returned too early

always in the block which is disabled for Windows - I wonder if illumos/Solaris has the same asynchronous socketpair feature?

operating system

OmniOS bloody

SunOS bloody 5.11 omnios-master-b463bb4090 i86pc i386 i86pc

The text was updated successfully, but these errors were encountered:

bagder · 2020-03-05T10:39:52Z

You are saying that this failure is new in 7.69.0 though and it didn't use to do this in 7.68.0 ?

If so, then it smells like perhaps #5019 caused this regression?

citrus-it · 2020-03-05T10:43:14Z

You are saying that this failure is new in 7.69.0 though and it didn't use to do this in 7.68.0 ?

Yes, this is a new failure with 7.69.0. I'll try reverting that change and see if the test starts working again.

citrus-it · 2020-03-05T16:26:06Z

It looks like the same test failure is there in 7.68.0 too, I just didn't see it when I ran the testsuite during that upgrade, just (un)lucky I suppose.

bagder · 2020-03-05T16:38:49Z

Ah, ok. Then the explanation isn't the commit mentioned above at least...

bagder · 2020-03-05T23:21:38Z

@citrus-it any chance you can debug it?

The test case first calls wakeup WAKEUP_NUM (1234567) number of times. Each call writes a single byte to the socketpair.

Then the test calls curl_multi_poll() to verify that it aborts very quicky (due to the pending wakeups).

Then it calls curl_multi_poll() again with no wakeup before. The previous call should then have drained the socketpair pipe and this call should now wait for the entire timeout period (1000 ms) and if it doesn't, it means that a previous wakeup somehow still made it return early!

citrus-it · 2020-03-05T23:59:51Z

Yes, I can look into it over the next week and get back to you.

citrus-it · 2020-03-06T08:05:32Z

I did a bit of digging last night and it looks like the recv() call on the socketpair can return -1/EAGAIN when it is not fully drained - i.e. a subsequent recv() call on the socket will return more data.

I added some sread() calls after the recv() loop and see this when the test works:

% ./lib1564 http://xxx
ERR: -1, 11, Resource temporarily unavailable             <- This is why the recv(64) loop has ended
CHECKING: -1                                                                   <- then I do more calls to sread() which all return -1
CHECKING: -1
CHECKING: -1
CHECKING: -1
CHECKING: -1
CHECKING: -1
err: 11 (Resource temporarily unavailable)

but when the test fails:

ERR: -1, 11, Resource temporarily unavailable              <- again, why the recv(64) loop has ended
CHECKING: 64                                                                  <- subsequent sread() calls return data.
CHECKING: 43
CHECKING: 23
CHECKING: 20
CHECKING: 16
CHECKING: 18
err: 0 (Error 0)
lib1564.c:132 curl_multi_poll returned too early      <- I've slimmed the test down hence different line no.

Interestingly, even in the loop not all calls to recv() are returning 64 bytes. Most return between 16 and 31.

% pfexec dtrace -n 'pid$target::recv:return/arg1 != 4294967295/{@ = quantize(arg1)}' -c './lib1564 http://www.fiddaman.net'
dtrace: description 'pid$target::recv:return' matched 1 probe
URL: http://www.fiddaman.net
lib1564.c:132 curl_multi_poll returned too early
dtrace: pid 10881 has exited


           value  ------------- Distribution ------------- count
               4 |                                         0
               8 |                                         2
              16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           510
              32 |@@@                                      43
              64 |@@@@@@@@                                 132
             128 |                                         0

bagder · 2020-03-06T16:38:34Z

Clearly on this platform, previous excessive amounts of wakeup-calls can linger in the socketpair pipe so that they will cancel future curl_multi_poll() calls in a rather unexpected way...

citrus-it · 2020-03-06T19:39:40Z

Yes, it appears that EAGAIN indicates that no data is available just now, but not that the pipe is necessarily empty. It's as if there is some latency there. I will ask some other developers about this.
(illumos is derived from Opensolaris so this problem could also show up on Solaris)

citrus-it · 2020-03-07T10:09:45Z

I wrote a small program to test this behaviour and tried it on a few different platforms that I have to hand.

On OmniOS, OpenIndiana and Solaris the read() loop does not always fully drain the pipe.
On FreeBSD and MacOSX (Catalina) it does.

I assume from the comment in the code that Windows is maybe in the first category too.

bagder · 2020-03-07T11:03:09Z

Thanks for this excellent input and data. So maybe we can change the test to just not be that excessive? Does the test work if you change WAKEUP_NUM at the top of it to just 10 or so?

We could possibly even consider changing the documentation to say something about these new findings but it also seems like a rather extreme edge case when someone would call wakeup() on the handle this many times without it even being "active".

citrus-it · 2020-03-07T11:55:31Z

I did try with 8192 writes since I noticed that was the number that were successfully written on FreeBSD and MacOSX before write started returning EAGAIN whereas on the other platforms that number is 21504.

I just ran the attached test program 100 times in a loop like this:

build% for i in `seq 1 100`; do
./test 10
done | sort | uniq -c
 100 Read 10 (Resource temporarily unavailable)
 100 SUCCESS
 100 Wrote 10

It was also fine with 512, 1024, 2048 and 4096

With 8192, I got

  27 FAIL
  73 SUCCESS

socketpair.c.txt

So, yes, I don't think this is a problem with normal operation of the library. The test could be less aggressive :)

This test does A LOT of *wakeup() calls and then calls curl_multi_poll() twice. The first *poll() is then expected to return early and the second not - as the first is supposed to drain the socketpair pipe. It turns out however that when given "excessive" amounts of writes to the pipe, some operating systems (the Solaris based are known) will return EAGAIN before the pipe is drained, which in our test case causes the second *poll() call to also abort early. This change attempts to avoid the OS-specific behaviors in the test by reducing the amount of wakeup calls from 1234567 to 10. Reported-by: Andy Fiddaman Fixes #5037

bagder · 2020-03-07T22:23:59Z

@citrus-it, you good with the suggested PR?

citrus-it · 2020-03-07T22:27:16Z

Yes, looks good to me, thanks.

bagder added the tests label Mar 5, 2020

bagder added the regression label Mar 5, 2020

bagder mentioned this issue Mar 7, 2020

lib1564: reduce number of mid-wait wakeup calls #5058

Closed

bagder closed this as completed in 0a04dc4 Mar 7, 2020

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test 1564 failing intermittently on illumos #5037

Test 1564 failing intermittently on illumos #5037

citrus-it commented Mar 4, 2020

bagder commented Mar 5, 2020 •

edited

citrus-it commented Mar 5, 2020

citrus-it commented Mar 5, 2020

bagder commented Mar 5, 2020

bagder commented Mar 5, 2020

citrus-it commented Mar 5, 2020 •

edited

citrus-it commented Mar 6, 2020

bagder commented Mar 6, 2020

citrus-it commented Mar 6, 2020

citrus-it commented Mar 7, 2020

bagder commented Mar 7, 2020 •

edited

citrus-it commented Mar 7, 2020 •

edited

bagder commented Mar 7, 2020

citrus-it commented Mar 7, 2020

Test 1564 failing intermittently on illumos #5037

Test 1564 failing intermittently on illumos #5037

Comments

citrus-it commented Mar 4, 2020

I did this

I expected the following

operating system

bagder commented Mar 5, 2020 • edited

citrus-it commented Mar 5, 2020

citrus-it commented Mar 5, 2020

bagder commented Mar 5, 2020

bagder commented Mar 5, 2020

citrus-it commented Mar 5, 2020 • edited

citrus-it commented Mar 6, 2020

bagder commented Mar 6, 2020

citrus-it commented Mar 6, 2020

citrus-it commented Mar 7, 2020

bagder commented Mar 7, 2020 • edited

citrus-it commented Mar 7, 2020 • edited

bagder commented Mar 7, 2020

citrus-it commented Mar 7, 2020

bagder commented Mar 5, 2020 •

edited

citrus-it commented Mar 5, 2020 •

edited

bagder commented Mar 7, 2020 •

edited

citrus-it commented Mar 7, 2020 •

edited