Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test 1238 causing shortly following tests 1242 and 1243 to fail #5364

Closed
wants to merge 1 commit into from

Conversation

mback2k
Copy link
Member

@mback2k mback2k commented May 9, 2020

At the moment tests 1242 and 1243 fail if the TFTP server is still busy with test 1238.

Therefore the TFTP server will now be killed after test 1238. This PR also includes other TFTP fixes.

@mback2k mback2k requested a review from bagder May 9, 2020 16:16
@mback2k mback2k self-assigned this May 9, 2020
@mback2k
Copy link
Member Author

mback2k commented May 9, 2020

The problem can be observed on AppVeyor (Windows) and CirrusCI (FreeBSD):
https://ci.appveyor.com/project/curlorg/curl/builds/32734281/job/79ysyh9b93vbhwf7?fullLog=true#L4305
https://cirrus-ci.com/task/4549026190196736?command=test#L2213

tests/server/tftpd.c Outdated Show resolved Hide resolved
tests/data/test1238 Show resolved Hide resolved
@Lekensteyn
Copy link
Contributor

So the hypothesis is that the server spends more time on the test than the client.

The relevant tftpd server code can be found here:

curl/tests/server/tftpd.c

Lines 1167 to 1215 in 54fa68b

do {
size = readit(test, (struct tftphdr **)&sdp, pf->f_convert);
if(size < 0) {
nak(errno + 100);
return;
}
sdp->th_opcode = htons((unsigned short)opcode_DATA);
sdp->th_block = htons(sendblock);
timeout = 0;
#ifdef HAVE_SIGSETJMP
(void) sigsetjmp(timeoutbuf, 1);
#endif
if(test->writedelay) {
logmsg("Pausing %d seconds before %d bytes", test->writedelay,
size);
wait_ms(1000*test->writedelay);
}
send_data:
if(swrite(peer, sdp, size + 4) != size + 4) {
logmsg("write");
return;
}
read_ahead(test, pf->f_convert);
for(;;) {
#ifdef HAVE_ALARM
alarm(rexmtval); /* read the ack */
#endif
n = sread(peer, &ackbuf.storage[0], sizeof(ackbuf.storage));
#ifdef HAVE_ALARM
alarm(0);
#endif
if(got_exit_signal)
return;
if(n < 0) {
logmsg("read: fail");
return;
}
sap->th_opcode = ntohs((unsigned short)sap->th_opcode);
sap->th_block = ntohs(sap->th_block);
if(sap->th_opcode == opcode_ERROR) {
logmsg("got ERROR");
return;
}
if(sap->th_opcode == opcode_ACK) {
if(sap->th_block == sendblock) {
break;

Based on an analysis of the logs, the server code and a packet capture, the test seems misconfigured. It has one iteration too much.

output for: ./runtests.pl -n -p -v 1238
startnew: perl -I/tmp/curl/tests /tmp/curl/tests/tftpserver.pl --pidfile ".tftp_server.pid" --portfile ".tftp_server.port" --logfile "log/tftp_server.log" --ipv4 --port 0 --srcdir "/tmp/curl/tests"
RUN: ../src/curl --max-time 13 --silent --verbose --globoff "tftp://127.0.0.1:33042/verifiedserver" 2>log/tftp_verify.log
RUN: Verifying our test tftp server took 0 seconds
RUN: TFTP server on PID 689417 port 33042
* pid tftp => 689417 689417
test 1238...[slow TFTP retrieve cancel due to -Y and -y]
"../src/curl" --output log/curl1238.out  --include --trace-ascii log/trace1238 --trace-time tftp://127.0.0.1:33042//1238 -Y1000 -y2 >log/stdout1238 2>log/stderr1238
CMD (7168): "../src/curl" --output log/curl1238.out  --include --trace-ascii log/trace1238 --trace-time tftp://127.0.0.1:33042//1238 -Y1000 -y2 >log/stdout1238 2>log/stderr1238
--p----e--- OK (1   out of 1  , remaining: 00:00, took 5.077s, duration: 00:05)
* kill pid for tftp => 689417
RUN: Process with pid 689417 signalled to die
RUN: Process with pid 689417 forced to die with SIGKILL
TESTDONE: 1 tests out of 1 reported OK: 100%
TESTDONE: 1 tests were considered during 5 seconds.
log/tftp_server.log
22:45:34.627366 Wrote pid 690179 to .tftp_server.pid
22:45:34.627558 Wrote port 56190 to .tftp_server.port
22:45:34.627587 Running IPv4 version on port UDP/56190
22:45:35.630455 trying to get file: verifiedserver mode 1
22:45:35.630495 Are-we-friendly question received
22:45:35.630753 end of one transfer
22:45:35.662002 trying to get file: /1238 mode 1
22:45:35.662041 requested test number 1238 part 0
22:45:35.662188 instructed to delay 2 secs between packets
22:45:35.662330 file opened and all is good
22:45:35.662380 Pausing 2 seconds before 512 bytes
22:45:37.665474 Pausing 2 seconds before 512 bytes
22:45:39.667913 read: fail
22:45:39.668125 end of one transfer
22:45:39.684584 exit_signal_handler: 15
22:45:39.684817 signalled to die
22:45:39.684863 ========> IPv4 tftpd (port: 56190 pid: 690179) exits with signal (15)

During the test I ran dumpcap -i lo -p -w tftp.pcapng to capture this result. Afterwards I analyzed it:

Packet capture summary

33042 is taken from the debug output above. The columns shown are: 1) time since begin of the capture, 2) unique ip/port tuple identifier, 3) UDP source port, 4) UDP destination port, 5) extra info.

$ tshark -r tftp.pcapng -dudp.port==33042,tftp -Y udp -Tfields -eframe.time_relative -eudp.stream -eudp.srcport -eudp.dstport -e_ws.col.Info
0.000503661     0       52352   33042   Read Request, File: verifiedserver, Transfer type: octet, tsize=0, blksize=512, timeout=4
0.000750454     1       50745   52352   Data Packet, Block: 1 (last)
0.000878674     1       52352   50745   Acknowledgement, Block: 1
0.028368623     2       59237   33042   Read Request, File: /1238, Transfer type: octet, tsize=0, blksize=512, timeout=6
2.031347308     3       51183   59237   Data Packet, Block: 1
2.032122733     3       59237   51183   Acknowledgement, Block: 1
4.034639911     3       51183   59237   Data Packet, Block: 2
4.034675245     3       51183   59237   Destination unreachable (Port unreachable)
  • The first three packets (streams 0 and 1) are the "Are-we-friendly" test to check whether the TFTP server works.
  • The next packet (stream 2) is the client request. (Step "1. The initiating host ..." from https://en.wikipedia.org/wiki/Trivial_File_Transfer_Protocol). The next responses will use a new ephemeral source port and end up as stream 3.
  • Client sends Read Request, in the code sendtftp will be called.
  • Round 1 (server calls readit to obtain the data for the client.)
  • The server sleeps for two seconds before responding (wait_ms).
  • The Data Packet (Block 1) is returned (swrite - socket write).
  • The server reads an ACK message (sread). Loop is terminated.
  • Round 2 (server calls readit to obtain the data for the client.)
  • Sleep two seconds. Somewhere in this period, curl exits because -y2 was specified, and more than two seconds were waited for.
  • Data Block 2 is returned.
  • The server tries to read an ACK but fails. In the packet capture, one can see a ICMP message. In the TFTP server log, one can see "read: fail"
  • Round 3 (does not occur in this trace, but if it would, then another 2 second delay will be present here for the last block)

However if for whatever reason (timing, failure to deliver ICMP, etc.) the server receives an ACK, then the server will continue to read the next. In total the server could be spending about 6 seconds whereas the client fails within a second.

I believe that another way to fix this issue is to reduce the test data for test1238 from 1200 bytes (3 blocks) to two blocks (e.g. 600 bytes). Assuming that -y2 covers the full transfer and curl checks often enough, this should reduce the server time to about four seconds. This would probably still prevent multiple TFTP tests from running straight after each other when "whatever reason" happens, but it should hopefully reduce the probability of failure.

@mback2k
Copy link
Member Author

mback2k commented May 10, 2020

So the hypothesis is that the server spends more time on the test than the client.

Yes, but I would not call it a hypothesis since I observed it on 2 CIs and in my local tests.

  • Data Block 2 is returned.
  • The server tries to read an ACK but fails. In the packet capture, one can see a ICMP message. In the TFTP server log, one can see "read: fail"

That "read: fail" does not happen on FreeBSD or Windows. The server is stuck reading and will timeout after 5 tries of 5 second waits using the alarm functionality instead of read timeouts.

However if for whatever reason (timing, failure to deliver ICMP, etc.) the server receives an ACK, then the server will continue to read the next. In total the server could be spending about 6 seconds whereas the client fails within a second.

Why would there be an ICMP deliver? curl just does not send the data via UDP packet anymore and the server is stuck waiting for that. Since there is no connection handling in UDP unlike TCP, the server will never get a "connection close event" from the client and read will not return 0.

I believe that another way to fix this issue is to reduce the test data for test1238 from 1200 bytes (3 blocks) to two blocks (e.g. 600 bytes). Assuming that -y2 covers the full transfer and curl checks often enough, this should reduce the server time to about four seconds.

Anyway, I will try your suggestion about reducing the transfer size. But are you sure that doesn't change the purpose of test 1238 timing out during the transfer instead of after the transfer has already finished and reached the server?

mback2k added a commit that referenced this pull request May 10, 2020
setjmp.h should only be included if HAVE_SETJMP_H is defined.

Add additional log statements to see wether reads and writes
are blocking or finishing before an alarm signal is received.

Assisted-by: Peter Wu
Part of #5364
@mback2k
Copy link
Member Author

mback2k commented May 10, 2020

Just for reference here is the the output of tftp_server.log after applying the logging patch:

tail -F --max-unchanged-stats=1 /mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log
tail: '/mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log' has appeared;  following new file
18:03:40.549649 Wrote pid 41685 to .tftp_server.pid
18:03:40.557604 Wrote port 61721 to .tftp_server.port
18:03:40.561203 Running IPv4 version on port UDP/61721
18:03:41.497863 trying to get file: verifiedserver mode 1
18:03:41.502108 Are-we-friendly question received
18:03:41.506101 write
18:03:41.510033 read
18:03:41.514265 read: 4
18:03:41.518265 end of one transfer
18:03:41.592375 trying to get file: /1238 mode 1
18:03:41.596370 requested test number 1238 part 0
18:03:41.603990 instructed to delay 2 secs between packets
18:03:41.609026 file opened and all is good
18:03:41.613141 Pausing 2 seconds before 512 bytes
18:03:43.617251 write
18:03:43.622778 read
18:03:43.627763 read: 4
18:03:43.632154 Pausing 2 seconds before 512 bytes
18:03:45.637263 write
18:03:45.642776 read
tail: '/mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log' has become inaccessible: No such file or directory
tail: '/mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log' has appeared;  following new file
18:03:50.642969 alarm!
18:03:50.649066 Pausing 2 seconds before 512 bytes
18:03:52.653897 write
18:03:52.659270 read
18:03:57.659304 alarm!
18:03:57.664369 Pausing 2 seconds before 512 bytes
18:03:59.668367 write
18:03:59.674096 read
18:04:04.673566 alarm!
18:04:04.678575 Pausing 2 seconds before 512 bytes
18:04:06.683427 write
18:04:06.688595 read
18:04:11.688193 alarm!
18:04:11.693438 Pausing 2 seconds before 512 bytes
18:04:13.697898 write
18:04:13.703173 read
18:04:18.703823 alarm!
18:04:18.709713 Error removing lock file log/serverlogs.lock error: 2 No such file or directory

As you can see curl is timing out during the 2nd block (as you described with Round 2), but there is no "read: fail" during test 1238 which can lead to tftp still waiting for 1238 when 1242 starts.

@mback2k
Copy link
Member Author

mback2k commented May 10, 2020

I believe that another way to fix this issue is to reduce the test data for test1238 from 1200 bytes (3 blocks) to two blocks (e.g. 600 bytes). Assuming that -y2 covers the full transfer and curl checks often enough, this should reduce the server time to about four seconds.

Anyway, I will try your suggestion about reducing the transfer size. But are you sure that doesn't change the purpose of test 1238 timing out during the transfer instead of after the transfer has already finished and reached the server?

I have now also tried your proposal and this does not help with the TFTP server still being busy:

tail: '/mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log' has appeared;  following new file
18:17:42.359598 Wrote pid 41757 to .tftp_server.pid
18:17:42.367565 Wrote port 54072 to .tftp_server.port
18:17:42.371288 Running IPv4 version on port UDP/54072
18:17:43.310802 trying to get file: verifiedserver mode 1
18:17:43.314363 Are-we-friendly question received
18:17:43.317974 write
18:17:43.321870 read
18:17:43.325764 read: 4
18:17:43.330793 end of one transfer
18:17:43.410485 trying to get file: /1238 mode 1
18:17:43.414669 requested test number 1238 part 0
18:17:43.423897 instructed to delay 2 secs between packets
18:17:43.428561 file opened and all is good
18:17:43.432963 Pausing 2 seconds before 512 bytes
18:17:45.438557 write
18:17:45.444257 read
18:17:45.449195 read: 4
18:17:45.453732 Pausing 2 seconds before 89 bytes
18:17:47.458586 write
18:17:47.463917 read
tail: '/mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log' has become inaccessible: No such file or directory
tail: '/mnt/c/Users/Marc/OS/curl/tests/log/tftp_server.log' has appeared;  following new file
18:17:52.463753 alarm!
18:17:52.468913 Pausing 2 seconds before 89 bytes
18:17:54.473760 write
18:17:54.478885 read
18:17:59.478894 alarm!
18:17:59.483911 Pausing 2 seconds before 89 bytes
18:18:01.488597 write
18:18:01.493980 read
18:18:06.493827 alarm!
18:18:06.498988 Pausing 2 seconds before 89 bytes
18:18:08.502931 write
18:18:08.508281 read
18:18:13.507731 alarm!
18:18:13.512914 Pausing 2 seconds before 89 bytes
18:18:15.517400 write
18:18:15.522630 read
18:18:20.522149 alarm!
18:18:20.528441 Error removing lock file log/serverlogs.lock error: 2 No such file or directory

Also the test explicitly states the size of 1200 as part of the test behavior:

# ~1200 bytes (so that they don't fit in two 512 byte chunks)

Therefore I would propose my original patch with an additional comment about the reason why killing the server is required in order to never affect other afterwards following TFTP tests.

The tftpd server may still be busy if the total timeout of
25 seconds has not been reached or no sread error was received
during or after the execution of the timeout test 1238.

Once the next TFTP test comes around (eg. 1242 or 1243),
those will fail because the tftpd server is still waiting
on data from curl due to the UDP protocol being stateless
and having no connection close. On Linux this error may not
happen, because ICMP errors generated due to a swrite error
can also be returned async on the next sread call instead.

Therefore we will now just kill the tftpd server after test
1238 to make sure that the following tests are not affected.

This enables us to no longer ignore tests 1242, 1243, 2002
and 2003 on the CI platforms CirrusCI and AppVeyor.

Assisted-by: Peter Wu
Closes curl#5364
@mback2k
Copy link
Member Author

mback2k commented May 10, 2020

More background why you may be seeing a "read: fail" on Linux while that is not related to the read itself, but the write instead:
https://stackoverflow.com/a/5824155/834537
http://man7.org/linux/man-pages/man7/udp.7.html
https://tools.ietf.org/html/rfc1122#page-78

As far as I can tell FreeBSD and Windows do not behave like that. Also see comments here: 000f721.

Why would there be an ICMP deliver? curl just does not send the data via UDP packet anymore and the server is stuck waiting for that. Since there is no connection handling in UDP unlike TCP, the server will never get a "connection close event" from the client and read will not return 0.

Regarding my quoted above comment of course, I now understand that you most likely refer to the ICMP error messages regarding send and not read failure according to what I described above.

@mback2k mback2k requested a review from Lekensteyn May 10, 2020 16:58
Copy link
Contributor

@Lekensteyn Lekensteyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing, LGTM!

@mback2k
Copy link
Member Author

mback2k commented May 10, 2020

@Lekensteyn thanks @bagder what do you think?

@mback2k mback2k closed this in cffbcc3 May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

2 participants