Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content_encoding: restart zlib in raw mode only if still possible. #2068

Closed
wants to merge 7 commits into from
Closed

content_encoding: restart zlib in raw mode only if still possible. #2068

wants to merge 7 commits into from

Conversation

monnerat
Copy link
Contributor

If the initial data is not available anymore or if output has
already started, restarting in raw mode is not possible.

This commit also rephrase the inflate_stream() procedure and checks
for deflate/gzip unparsable trailing data bytes.

New test 232 checks the deflate algorithm in raw mode.

If the initial data is not available anymore or if output has
already started, restarting in raw mode is not possible.

This commit also rephrase the inflate_stream() procedure and checks
for deflate/gzip unparsable trailing data bytes.

New test 232 checks the deflate algorithm in raw mode.
break;
default:
result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR);
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to understand:

if(zp->zlib_init != ZLIB_INIT &&
   zp->zlib_init != ZLIB_INFLATING &&
   zp->zlib_init != ZLIB_INIT_GZIP &&
   zp->zlib_init != ZLIB_GZIP_INFLATING) {
  result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR);
  break;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the review.
Maybe you're right. At least we have the same effect.

result = exit_zlib(conn, z, &zp->zlib_init, result);
}
z->avail_in = 0;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary and did you account in these changes for concatenated gzip streams, I've read that is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not "necessary": it just allows detecting trailing bytes (after the stream), while the previous code silently ignored the trailing bytes in buffer and restartied a possibly desynced stream at the next call if some. See

/* Done? clean up, return */

Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.

did you account in these changes for concatenated gzip streams, I've read that is possible.

No, and I think this was never supported in curl: see the above comment. In any case, use of zlib < 1.2.0.4 cannot allow it if we don't process the trailer explicitly.

The link to Mark Adler's comment is about gzip files and this is primarily a mean to support gzip archives with several files. Strictly speaking, you're right about concatenation (see RFC 2616, 3.5). In practice, I don't know if it is also intended for HTTP streams, if some server can generate it or if some client supports it and how (cannot find something about it on internet). Functionally speaking, concatenation only makes sense if curl does not inflate (thus outputting a real gzip file).

Supporting concatenation could be a new feature. TBD since it implies a loss of structure data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.

Probably but there's a lot of wackiness out there.

if(status == Z_OK || status == Z_STREAM_END) {
allow_restart = 0;
/* Flush output data if some. */
if(z->avail_out != DSIZ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find in the doc that it's safe to handle this without checking that status == Z_OK || status == Z_STREAM_END

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I can't find that it is not safe. inflate() always maintains the in/out pointers/counters properly. If some data is inflated before an error is detected in the same call, it returns the error status but there are some data to flush in the output buffer. curl flushes as much as possible before returning the error. See https://github.com/madler/zlib//blob/master/inflate.c#L1248

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know I'm taking it upstream I'll cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?

No nobody replied

/cc @madler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I wrote a test program for it. See below.

This shows that, with inflate(Z_SYNC_FLUSH), although the pointers and counters are always kept in sync, there might be some trailing garbage generated in the output buffer before the error is detected. So i'll reintroduce the status test before writing.

Please note that this alone will not avoid garbage output in all cases: if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate(). The solution is to use Z_BLOCK instead of Z_SYNC_FLUSH: this will process one deflate block at a time and will reduce the "chances" to have output garbage data mixed with good data, maybe at the expense of a little more overhead. End of processing will be signalled by status Z_BUF_ERROR.

By playing with the size of decbuf (i.e.: setting it to 1024), you'll see a single inflate(Z_SYNC_FLUSH) call does generate correct data followed by garbage before detecting the error.

#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char		testdata[] = "Zlib compressed data";

int
main(int argc, char * * argv)
{
	z_stream z;
	size_t enclen;
	char encbuf[1024];
	char decbuf[24];
	int zlibrc;

	/* Build compressed corrupt data. */
	memset((char *) &z, 0, sizeof z);
	deflateInit2(&z, 9, Z_DEFLATED, -MAX_WBITS, 8, Z_DEFAULT_STRATEGY);
	z.next_out = (Bytef *) encbuf;
	z.avail_out = sizeof encbuf;
	z.next_in = (Bytef *) testdata;
	z.avail_in = strlen(testdata);
	deflate(&z, Z_SYNC_FLUSH);
	strncpy((char *) z.next_out, "zlib corrupt data", z.avail_out);
	enclen = sizeof encbuf - z.avail_out + strlen((char *) z.next_out);
	deflateEnd(&z);

	/* Reinflate. */
	inflateInit2(&z, -MAX_WBITS);
	z.next_in = (Bytef *) encbuf;
	z.avail_in = enclen;
	do {
		z.next_out = (Bytef *) decbuf;
		z.avail_out = sizeof decbuf;
		zlibrc = inflate(&z, Z_SYNC_FLUSH);
		printf("consumed: %u bytes, inflated: %u bytes, " \
		       "inflate return code = %d (%s)\n",
		       enclen - z.avail_in, sizeof decbuf - z.avail_out,
		       zlibrc, z.msg);
	} while (zlibrc == Z_OK);
	exit(0);
}

I'll push these changes as soon as my local "make check" terminates successfully.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not getting what the question is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not getting what the question is.

We were trying to figure out whether it was safe to use the data written by inflate if we just checked avail_out to see if it had changed regardless of checking inflate return value. In other words is what is written in next_out always valid. Let's say next_out is buf[100] and avail_out before is 100 but after is 30. Does that mean the 70 bytes written by inflate is always valid?

@monnerat has since wrote a test in the previous post that appears to show inflate z_ok even with corrupt input.

if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate()

I added the libcurl dump function to the test to visualize it:

--- ztest.c.orig	2017-11-25 14:49:54.732905692 -0500
+++ ztest.c	2017-11-25 14:54:04.076899105 -0500
@@ -3,6 +3,37 @@
 #include <stdlib.h>
 #include <string.h>
 
+void dump(const char *text, FILE *stream,
+          const unsigned char *ptr, unsigned long long size)
+{
+  unsigned long long i;
+  unsigned int c, width = 0x10;
+
+  fprintf(stream, "%s length is %llu bytes (0x%llx)\n",
+          text, size, size);
+
+  for(i = 0; i < size; i += width) {
+
+    fprintf(stream, "%8.8llx: ", i);
+
+    /* show hex to the left */
+    for(c = 0; c < width; c++) {
+      if(i+c < size)
+        fprintf(stream, "%02x ", ptr[i+c]);
+      else
+        fputs("   ", stream);
+    }
+
+    /* show data on the right */
+    for(c = 0; (c < width) && (i+c < size); c++) {
+      char x = (ptr[i+c] >= 0x20 && ptr[i+c] < 0x80) ? ptr[i+c] : '.';
+      fputc(x, stream);
+    }
+
+    fputc('\n', stream); /* newline */
+  }
+}
+
 char		testdata[] = "Zlib compressed data";
 
 int
@@ -34,10 +65,11 @@
 		z.next_out = (Bytef *) decbuf;
 		z.avail_out = sizeof decbuf;
 		zlibrc = inflate(&z, Z_SYNC_FLUSH);
-		printf("consumed: %u bytes, inflated: %u bytes, " \
+		printf("consumed: %zu bytes, inflated: %zu bytes, " \
 		       "inflate return code = %d (%s)\n",
 		       enclen - z.avail_in, sizeof decbuf - z.avail_out,
 		       zlibrc, z.msg);
+    dump("decbuf", stdout, (unsigned char *)decbuf, sizeof decbuf - z.avail_out);
 	} while (zlibrc == Z_OK);
 	exit(0);
 }
owner@ubuntu1604-x64-vm:~$ ./ztest
consumed: 32 bytes, inflated: 24 bytes, inflate return code = 0 ((null))
decbuf length is 24 bytes (0x18)
00000000: 5a 6c 69 62 20 63 6f 6d 70 72 65 73 73 65 64 20 Zlib compressed 
00000010: 64 61 74 61 e3 39 34 30                         data.940
consumed: 42 bytes, inflated: 8 bytes, inflate return code = -3 (invalid distance too far back)
decbuf length is 8 bytes (0x8)
00000000: 1c 3f 34 c9 ab 53 5b 51                         .?4..S[Q

.940 and after is the garbage. it seems to me if we're dealing with bad input we may not be able to detect all bad output on inflate?

@monnerat
Copy link
Contributor Author

There's a failing Travis job, but i'm pretty sure it is not related.
https://travis-ci.org/curl/curl/jobs/300982481

The zlib documentation states that when the output buffer has
been fully filled by inflate(), there still may be some output
data bytes latched in state: even if all input data have been
consumed, inflate() must be called again to obtain these bytes.
This reduces the "chances" to have corrupt data silently appended to
properly inflated data.

Also flush data only if status is Z_OK or Z_STREAM_END.
@monnerat
Copy link
Contributor Author

@jay : Review stalled ?
If you have no objection, il'll commit this soon, considering the following points.

  • zlib is a compression library with no data validity check: data errors are only signalled when zlib cannot handle what is intended as one of its own control values: we cannot act on it at the libcurl level. In addition, even non-compressed data can be corrupted without internal detection either. In other words, zlib returning Z_OK is not a guarantee of data integrity, but just an indication zlib was able to do its job.
  • The use of Z_BLOCK instead of Z_SYNC_FLUSH improves early detection of improper data, while preserving as much correct data as possible.
  • Test on Z_OK or Z_STREAM_END is back, thus preventing blocks with detected corrupt data to be written.
  • The initial PR goal is respected.
  • We still flush as much correct output data as possible.

Of course, there will still be cases where corrupt data remains undetected, i.e.: if non-zlib-control bytes are altered.

The overall changes in this PR improve zlib inflating, both for restarting in raw mode and at the data flushing/error detection level.

@jay
Copy link
Member

jay commented Dec 12, 2017

@monnerat That all sounds good. I have not done a final review yet but I will. Also I have a question about the different flags, from the zlib manual:

Z_SYNC_FLUSH requests that inflate() flush as much output as possible to the output buffer. Z_BLOCK requests that inflate() stop if and when it gets to the next deflate block boundary.

If we don't flush the data then could we end up in a situation where some huge block can't be decompressed (inflated)?

@monnerat
Copy link
Contributor Author

If we don't flush the data then could we end up in a situation where some huge block can't be decompressed (inflated)?

I made another test program, creating 100000 'A' characters and compressing them in one deflate(Z_FINISH) call. Then loop inflates back the data into a 1024-byte buffer, returning Z_OK each time the output buffer is full, as it was the case with Z_SYNC_FLUSH. I can imagine Z_BLOCK just forces return when a deflated block ends, whatever is the output byte count. Thus the case you described should not occur.
In addition, we still flush as much (correct) data as possible, since the loop is not anymore ended when the input byte count left is zero, but when inflate() returns Z_BUF_ERROR, meaning it was not able to consume nor produce any byte.

}
}
/* UNREACHED */
free(decomp);
if(nread && zp->zlib_init == ZLIB_INIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're about to leave this call thus the nread-bytes won't be seen again. If we are in a state that would wrongly allow restart in raw mode at the next call, change state.

if(inflateInit2(z, -MAX_WBITS) != Z_OK) {
return process_zlib_error(conn, z);
}
zp->trailerlen = 8; /* Trailer length. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you get this magic number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://www.zlib.org/rfc-gzip.html#file-format

A trailer is made of a CRC32 and a 32-bit input size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->) is describing the structure and so trailer is after ...compressed blocks....

if(inflateInit2(z, -MAX_WBITS) != Z_OK) {
return process_zlib_error(conn, z);
}
zp->trailerlen = 8; /* Trailer length. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->) is describing the structure and so trailer is after ...compressed blocks....

@jay
Copy link
Member

jay commented Dec 14, 2017

re commits I think some of them could be squashed but i'll leave it to you. also please add Ref: or Closes etc

@monnerat
Copy link
Contributor Author

@jay : thanks for review.

@monnerat monnerat closed this in 4acc9d3 Dec 20, 2017
@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants