Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test 1564 failing intermittently on illumos #5037

Closed
citrus-it opened this issue Mar 4, 2020 · 14 comments
Closed

Test 1564 failing intermittently on illumos #5037

citrus-it opened this issue Mar 4, 2020 · 14 comments

Comments

@citrus-it
Copy link

I did this

Upgrading to curl 7.69.0 on OmniOS (an illumos distribution)

I expected the following

No new failing tests. The curl_multi_poll test (1564) is failing around 7/10 of the time with:

lib1564.c:128 curl_multi_poll returned too early

always in the block which is disabled for Windows - I wonder if illumos/Solaris has the same asynchronous socketpair feature?

operating system

OmniOS bloody

SunOS bloody 5.11 omnios-master-b463bb4090 i86pc i386 i86pc
@bagder bagder added the tests label Mar 5, 2020
@bagder
Copy link
Member

bagder commented Mar 5, 2020

You are saying that this failure is new in 7.69.0 though and it didn't use to do this in 7.68.0 ?

If so, then it smells like perhaps #5019 caused this regression?

@citrus-it
Copy link
Author

You are saying that this failure is new in 7.69.0 though and it didn't use to do this in 7.68.0 ?

Yes, this is a new failure with 7.69.0. I'll try reverting that change and see if the test starts working again.

@citrus-it
Copy link
Author

It looks like the same test failure is there in 7.68.0 too, I just didn't see it when I ran the testsuite during that upgrade, just (un)lucky I suppose.

@bagder
Copy link
Member

bagder commented Mar 5, 2020

Ah, ok. Then the explanation isn't the commit mentioned above at least...

@bagder
Copy link
Member

bagder commented Mar 5, 2020

@citrus-it any chance you can debug it?

The test case first calls wakeup WAKEUP_NUM (1234567) number of times. Each call writes a single byte to the socketpair.

Then the test calls curl_multi_poll() to verify that it aborts very quicky (due to the pending wakeups).

Then it calls curl_multi_poll() again with no wakeup before. The previous call should then have drained the socketpair pipe and this call should now wait for the entire timeout period (1000 ms) and if it doesn't, it means that a previous wakeup somehow still made it return early!

@citrus-it
Copy link
Author

citrus-it commented Mar 5, 2020

Yes, I can look into it over the next week and get back to you.

@citrus-it
Copy link
Author

I did a bit of digging last night and it looks like the recv() call on the socketpair can return -1/EAGAIN when it is not fully drained - i.e. a subsequent recv() call on the socket will return more data.

I added some sread() calls after the recv() loop and see this when the test works:

% ./lib1564 http://xxx
ERR: -1, 11, Resource temporarily unavailable             <- This is why the recv(64) loop has ended
CHECKING: -1                                                                   <- then I do more calls to sread() which all return -1
CHECKING: -1
CHECKING: -1
CHECKING: -1
CHECKING: -1
CHECKING: -1
err: 11 (Resource temporarily unavailable)

but when the test fails:

ERR: -1, 11, Resource temporarily unavailable              <- again, why the recv(64) loop has ended
CHECKING: 64                                                                  <- subsequent sread() calls return data.
CHECKING: 43
CHECKING: 23
CHECKING: 20
CHECKING: 16
CHECKING: 18
err: 0 (Error 0)
lib1564.c:132 curl_multi_poll returned too early      <- I've slimmed the test down hence different line no.

Interestingly, even in the loop not all calls to recv() are returning 64 bytes. Most return between 16 and 31.

% pfexec dtrace -n 'pid$target::recv:return/arg1 != 4294967295/{@ = quantize(arg1)}' -c './lib1564 http://www.fiddaman.net'
dtrace: description 'pid$target::recv:return' matched 1 probe
URL: http://www.fiddaman.net
lib1564.c:132 curl_multi_poll returned too early
dtrace: pid 10881 has exited


           value  ------------- Distribution ------------- count
               4 |                                         0
               8 |                                         2
              16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           510
              32 |@@@                                      43
              64 |@@@@@@@@                                 132
             128 |                                         0

@bagder
Copy link
Member

bagder commented Mar 6, 2020

Clearly on this platform, previous excessive amounts of wakeup-calls can linger in the socketpair pipe so that they will cancel future curl_multi_poll() calls in a rather unexpected way...

@citrus-it
Copy link
Author

Yes, it appears that EAGAIN indicates that no data is available just now, but not that the pipe is necessarily empty. It's as if there is some latency there. I will ask some other developers about this.
(illumos is derived from Opensolaris so this problem could also show up on Solaris)

@citrus-it
Copy link
Author

I wrote a small program to test this behaviour and tried it on a few different platforms that I have to hand.

On OmniOS, OpenIndiana and Solaris the read() loop does not always fully drain the pipe.
On FreeBSD and MacOSX (Catalina) it does.

I assume from the comment in the code that Windows is maybe in the first category too.

@bagder
Copy link
Member

bagder commented Mar 7, 2020

Thanks for this excellent input and data. So maybe we can change the test to just not be that excessive? Does the test work if you change WAKEUP_NUM at the top of it to just 10 or so?

We could possibly even consider changing the documentation to say something about these new findings but it also seems like a rather extreme edge case when someone would call wakeup() on the handle this many times without it even being "active".

@citrus-it
Copy link
Author

citrus-it commented Mar 7, 2020

I did try with 8192 writes since I noticed that was the number that were successfully written on FreeBSD and MacOSX before write started returning EAGAIN whereas on the other platforms that number is 21504.

I just ran the attached test program 100 times in a loop like this:

build% for i in `seq 1 100`; do
./test 10
done | sort | uniq -c
 100 Read 10 (Resource temporarily unavailable)
 100 SUCCESS
 100 Wrote 10

It was also fine with 512, 1024, 2048 and 4096

With 8192, I got

  27 FAIL
  73 SUCCESS

socketpair.c.txt

So, yes, I don't think this is a problem with normal operation of the library. The test could be less aggressive :)

bagder added a commit that referenced this issue Mar 7, 2020
This test does A LOT of *wakeup() calls and then calls curl_multi_poll()
twice. The first *poll() is then expected to return early and the second
not - as the first is supposed to drain the socketpair pipe.

It turns out however that when given "excessive" amounts of writes to
the pipe, some operating systems (the Solaris based are known) will
return EAGAIN before the pipe is drained, which in our test case causes
the second *poll() call to also abort early.

This change attempts to avoid the OS-specific behaviors in the test by
reducing the amount of wakeup calls from 1234567 to 10.

Reported-by: Andy Fiddaman
Fixes #5037
@bagder
Copy link
Member

bagder commented Mar 7, 2020

@citrus-it, you good with the suggested PR?

@citrus-it
Copy link
Author

Yes, looks good to me, thanks.

@bagder bagder closed this as completed in 0a04dc4 Mar 7, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging a pull request may close this issue.

2 participants