content_encoding: restart zlib in raw mode only if still possible. #2068

monnerat · 2017-11-11T12:47:31Z

If the initial data is not available anymore or if output has
already started, restarting in raw mode is not possible.

This commit also rephrase the inflate_stream() procedure and checks
for deflate/gzip unparsable trailing data bytes.

New test 232 checks the deflate algorithm in raw mode.

If the initial data is not available anymore or if output has already started, restarting in raw mode is not possible. This commit also rephrase the inflate_stream() procedure and checks for deflate/gzip unparsable trailing data bytes. New test 232 checks the deflate algorithm in raw mode.

jay · 2017-11-11T22:46:55Z

lib/content_encoding.c

+      break;
+    default:
+      result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR);
+      continue;


I think this would be easier to understand:

if(zp->zlib_init != ZLIB_INIT && zp->zlib_init != ZLIB_INFLATING && zp->zlib_init != ZLIB_INIT_GZIP && zp->zlib_init != ZLIB_GZIP_INFLATING) { result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR); break; }

Many thanks for the review.
Maybe you're right. At least we have the same effect.

jay · 2017-11-11T22:52:09Z

lib/content_encoding.c

+        result = exit_zlib(conn, z, &zp->zlib_init, result);
+      }
+      z->avail_in = 0;
+      break;


Is this check necessary and did you account in these changes for concatenated gzip streams, I've read that is possible.

This check is not "necessary": it just allows detecting trailing bytes (after the stream), while the previous code silently ignored the trailing bytes in buffer and restartied a possibly desynced stream at the next call if some. See

curl/lib/content_encoding.c

Line 128 in 462f3ca

/* Done? clean up, return */

Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.

did you account in these changes for concatenated gzip streams, I've read that is possible.

No, and I think this was never supported in curl: see the above comment. In any case, use of zlib < 1.2.0.4 cannot allow it if we don't process the trailer explicitly.

The link to Mark Adler's comment is about gzip files and this is primarily a mean to support gzip archives with several files. Strictly speaking, you're right about concatenation (see RFC 2616, 3.5). In practice, I don't know if it is also intended for HTTP streams, if some server can generate it or if some client supports it and how (cannot find something about it on internet). Functionally speaking, concatenation only makes sense if curl does not inflate (thus outputting a real gzip file).

Supporting concatenation could be a new feature. TBD since it implies a loss of structure data.

Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.

Probably but there's a lot of wackiness out there.

jay · 2017-11-12T06:37:35Z

lib/content_encoding.c

-    if(status == Z_OK || status == Z_STREAM_END) {
-      allow_restart = 0;
+    /* Flush output data if some. */
+    if(z->avail_out != DSIZ) {


I can't find in the doc that it's safe to handle this without checking that status == Z_OK || status == Z_STREAM_END

Alternatively, I can't find that it is not safe. inflate() always maintains the in/out pointers/counters properly. If some data is inflated before an error is detected in the same call, it returns the error status but there are some data to flush in the output buffer. curl flushes as much as possible before returning the error. See https://github.com/madler/zlib//blob/master/inflate.c#L1248

I don't know I'm taking it upstream I'll cc

I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?

I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?

No nobody replied

/cc @madler

Ok, I wrote a test program for it. See below.

This shows that, with inflate(Z_SYNC_FLUSH), although the pointers and counters are always kept in sync, there might be some trailing garbage generated in the output buffer before the error is detected. So i'll reintroduce the status test before writing.

Please note that this alone will not avoid garbage output in all cases: if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate(). The solution is to use Z_BLOCK instead of Z_SYNC_FLUSH: this will process one deflate block at a time and will reduce the "chances" to have output garbage data mixed with good data, maybe at the expense of a little more overhead. End of processing will be signalled by status Z_BUF_ERROR.

By playing with the size of decbuf (i.e.: setting it to 1024), you'll see a single inflate(Z_SYNC_FLUSH) call does generate correct data followed by garbage before detecting the error.

#include <zlib.h> #include <stdio.h> #include <stdlib.h> #include <string.h> char testdata[] = "Zlib compressed data"; int main(int argc, char * * argv) { z_stream z; size_t enclen; char encbuf[1024]; char decbuf[24]; int zlibrc; /* Build compressed corrupt data. */ memset((char *) &z, 0, sizeof z); deflateInit2(&z, 9, Z_DEFLATED, -MAX_WBITS, 8, Z_DEFAULT_STRATEGY); z.next_out = (Bytef *) encbuf; z.avail_out = sizeof encbuf; z.next_in = (Bytef *) testdata; z.avail_in = strlen(testdata); deflate(&z, Z_SYNC_FLUSH); strncpy((char *) z.next_out, "zlib corrupt data", z.avail_out); enclen = sizeof encbuf - z.avail_out + strlen((char *) z.next_out); deflateEnd(&z); /* Reinflate. */ inflateInit2(&z, -MAX_WBITS); z.next_in = (Bytef *) encbuf; z.avail_in = enclen; do { z.next_out = (Bytef *) decbuf; z.avail_out = sizeof decbuf; zlibrc = inflate(&z, Z_SYNC_FLUSH); printf("consumed: %u bytes, inflated: %u bytes, " \ "inflate return code = %d (%s)\n", enclen - z.avail_in, sizeof decbuf - z.avail_out, zlibrc, z.msg); } while (zlibrc == Z_OK); exit(0); }

I'll push these changes as soon as my local "make check" terminates successfully.

Sorry, I'm not getting what the question is.

Sorry, I'm not getting what the question is.

We were trying to figure out whether it was safe to use the data written by inflate if we just checked avail_out to see if it had changed regardless of checking inflate return value. In other words is what is written in next_out always valid. Let's say next_out is buf[100] and avail_out before is 100 but after is 30. Does that mean the 70 bytes written by inflate is always valid?

@monnerat has since wrote a test in the previous post that appears to show inflate z_ok even with corrupt input.

if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate()

I added the libcurl dump function to the test to visualize it:

--- ztest.c.orig 2017-11-25 14:49:54.732905692 -0500 +++ ztest.c 2017-11-25 14:54:04.076899105 -0500 @@ -3,6 +3,37 @@ #include <stdlib.h> #include <string.h> +void dump(const char *text, FILE *stream, + const unsigned char *ptr, unsigned long long size) +{ + unsigned long long i; + unsigned int c, width = 0x10; + + fprintf(stream, "%s length is %llu bytes (0x%llx)\n", + text, size, size); + + for(i = 0; i < size; i += width) { + + fprintf(stream, "%8.8llx: ", i); + + /* show hex to the left */ + for(c = 0; c < width; c++) { + if(i+c < size) + fprintf(stream, "%02x ", ptr[i+c]); + else + fputs(" ", stream); + } + + /* show data on the right */ + for(c = 0; (c < width) && (i+c < size); c++) { + char x = (ptr[i+c] >= 0x20 && ptr[i+c] < 0x80) ? ptr[i+c] : '.'; + fputc(x, stream); + } + + fputc('\n', stream); /* newline */ + } +} + char testdata[] = "Zlib compressed data"; int @@ -34,10 +65,11 @@ z.next_out = (Bytef *) decbuf; z.avail_out = sizeof decbuf; zlibrc = inflate(&z, Z_SYNC_FLUSH); - printf("consumed: %u bytes, inflated: %u bytes, " \ + printf("consumed: %zu bytes, inflated: %zu bytes, " \ "inflate return code = %d (%s)\n", enclen - z.avail_in, sizeof decbuf - z.avail_out, zlibrc, z.msg); + dump("decbuf", stdout, (unsigned char *)decbuf, sizeof decbuf - z.avail_out); } while (zlibrc == Z_OK); exit(0); }

owner@ubuntu1604-x64-vm:~$ ./ztest consumed: 32 bytes, inflated: 24 bytes, inflate return code = 0 ((null)) decbuf length is 24 bytes (0x18) 00000000: 5a 6c 69 62 20 63 6f 6d 70 72 65 73 73 65 64 20 Zlib compressed 00000010: 64 61 74 61 e3 39 34 30 data.940 consumed: 42 bytes, inflated: 8 bytes, inflate return code = -3 (invalid distance too far back) decbuf length is 8 bytes (0x8) 00000000: 1c 3f 34 c9 ab 53 5b 51 .?4..S[Q

.940 and after is the garbage. it seems to me if we're dealing with bad input we may not be able to detect all bad output on inflate?

monnerat · 2017-11-12T17:22:59Z

There's a failing Travis job, but i'm pretty sure it is not related.
https://travis-ci.org/curl/curl/jobs/300982481

The zlib documentation states that when the output buffer has been fully filled by inflate(), there still may be some output data bytes latched in state: even if all input data have been consumed, inflate() must be called again to obtain these bytes.

This reduces the "chances" to have corrupt data silently appended to properly inflated data. Also flush data only if status is Z_OK or Z_STREAM_END.

monnerat · 2017-12-11T14:34:38Z

@jay : Review stalled ?
If you have no objection, il'll commit this soon, considering the following points.

zlib is a compression library with no data validity check: data errors are only signalled when zlib cannot handle what is intended as one of its own control values: we cannot act on it at the libcurl level. In addition, even non-compressed data can be corrupted without internal detection either. In other words, zlib returning Z_OK is not a guarantee of data integrity, but just an indication zlib was able to do its job.
The use of Z_BLOCK instead of Z_SYNC_FLUSH improves early detection of improper data, while preserving as much correct data as possible.
Test on Z_OK or Z_STREAM_END is back, thus preventing blocks with detected corrupt data to be written.
The initial PR goal is respected.
We still flush as much correct output data as possible.

Of course, there will still be cases where corrupt data remains undetected, i.e.: if non-zlib-control bytes are altered.

The overall changes in this PR improve zlib inflating, both for restarting in raw mode and at the data flushing/error detection level.

jay · 2017-12-12T08:07:36Z

@monnerat That all sounds good. I have not done a final review yet but I will. Also I have a question about the different flags, from the zlib manual:

Z_SYNC_FLUSH requests that inflate() flush as much output as possible to the output buffer. Z_BLOCK requests that inflate() stop if and when it gets to the next deflate block boundary.

If we don't flush the data then could we end up in a situation where some huge block can't be decompressed (inflated)?

monnerat · 2017-12-12T16:28:50Z

If we don't flush the data then could we end up in a situation where some huge block can't be decompressed (inflated)?

I made another test program, creating 100000 'A' characters and compressing them in one deflate(Z_FINISH) call. Then loop inflates back the data into a 1024-byte buffer, returning Z_OK each time the output buffer is full, as it was the case with Z_SYNC_FLUSH. I can imagine Z_BLOCK just forces return when a deflated block ends, whatever is the output byte count. Thus the case you described should not occur.
In addition, we still flush as much (correct) data as possible, since the loop is not anymore ended when the input byte count left is zero, but when inflate() returns Z_BUF_ERROR, meaning it was not able to consume nor produce any byte.

jay · 2017-12-13T07:17:39Z

lib/content_encoding.c

    }
  }
-  /* UNREACHED */
+  free(decomp);
+  if(nread && zp->zlib_init == ZLIB_INIT)


what is this for

We're about to leave this call thus the nread-bytes won't be seen again. If we are in a state that would wrongly allow restart in raw mode at the next call, change state.

jay · 2017-12-13T07:20:19Z

lib/content_encoding.c

    if(inflateInit2(z, -MAX_WBITS) != Z_OK) {
      return process_zlib_error(conn, z);
    }
+    zp->trailerlen = 8;          /* Trailer length. */


where did you get this magic number

http://www.zlib.org/rfc-gzip.html#file-format

A trailer is made of a CRC32 and a 32-bit input size.

I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->) is describing the structure and so trailer is after ...compressed blocks....

jay · 2017-12-14T08:20:06Z

lib/content_encoding.c

    if(inflateInit2(z, -MAX_WBITS) != Z_OK) {
      return process_zlib_error(conn, z);
    }
+    zp->trailerlen = 8;          /* Trailer length. */


I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->) is describing the structure and so trailer is after ...compressed blocks....

jay · 2017-12-14T08:24:57Z

re commits I think some of them could be squashed but i'll leave it to you. also please add Ref: or Closes etc

monnerat · 2017-12-20T13:56:12Z

@jay : thanks for review.

jay reviewed Nov 11, 2017

View reviewed changes

jay reviewed Nov 12, 2017

View reviewed changes

monnerat added 3 commits November 12, 2017 12:32

content_encoding: clarify state check in inflate_stream()

19f33f3

content_encoding: comment about not using inflateReset2()

8858749

content_encoding: move trailing data bytes processing in a function

ca9d385

monnerat added 3 commits November 22, 2017 15:24

inflate_stream: rearrange Z_DATA_ERROR code sequence

2a8bc85

inflate_stream: process one deflate block at a time

7363315

This reduces the "chances" to have corrupt data silently appended to properly inflated data. Also flush data only if status is Z_OK or Z_STREAM_END.

jay reviewed Dec 13, 2017

View reviewed changes

jay approved these changes Dec 14, 2017

View reviewed changes

monnerat closed this in 4acc9d3 Dec 20, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content_encoding: restart zlib in raw mode only if still possible. #2068

content_encoding: restart zlib in raw mode only if still possible. #2068

monnerat commented Nov 11, 2017

jay Nov 11, 2017

monnerat Nov 12, 2017

jay Nov 11, 2017

monnerat Nov 12, 2017

jay Nov 12, 2017

jay Nov 12, 2017

monnerat Nov 12, 2017

jay Nov 12, 2017

monnerat Nov 22, 2017

jay Nov 22, 2017

monnerat Nov 23, 2017

madler Nov 23, 2017

jay Nov 25, 2017

monnerat commented Nov 12, 2017

monnerat commented Dec 11, 2017

jay commented Dec 12, 2017 •

edited

monnerat commented Dec 12, 2017

jay Dec 13, 2017

monnerat Dec 13, 2017

jay Dec 13, 2017

monnerat Dec 13, 2017

jay Dec 14, 2017

jay Dec 14, 2017

jay commented Dec 14, 2017

monnerat commented Dec 20, 2017

content_encoding: restart zlib in raw mode only if still possible. #2068

content_encoding: restart zlib in raw mode only if still possible. #2068

Conversation

monnerat commented Nov 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

monnerat commented Nov 12, 2017

monnerat commented Dec 11, 2017

jay commented Dec 12, 2017 • edited

monnerat commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jay commented Dec 14, 2017

monnerat commented Dec 20, 2017

jay commented Dec 12, 2017 •

edited