> >That is not the case.
> >We know there is data available to read (but we cannot know how much).
> >We must already have a known buffer area to store data in, then call
> >recv() to get the data and then store it in the buffer (possibly after
> >some "decoding" like SSL, chunked HTTP and similar).
> >Therefore, we must already have been given a buffer pointer (and size)
> >from the application layer where we can store the received data BEFORE
> >we call the write callback.
I have thought about zero-copy for many years, so I'll add me 2p:
All this talk of storing data to files, further processing in memory,
etc., assumes that the incoming data comes from a network socket, is
not filtered, and is not surrounded by protocol of unpredictable
length (such as HTTP headers). That is not true for some protocol
If you want sinks and sources to handle: files, network streams,
further processing in memory, SSL, compression, HTTP chunk
encoding, etc., and you want them all to use the minimum number of
copies, then you _cannot_ use a fixed pattern such as "library
calls application to request buffer; library stores data into
That's the pattern Legolas suggests, and it works fine when data is
simply being read from a socket.
But, for example, if the library needs to decompress the data (zlib)
or decrypt it (SSL), then that interface actually causes more copies
than necessary. For example, the zlib decompression algorithm
_requires_ all the incoming data to be stored in a 32k circular buffer
- that's part of the algorithm.
Legolas' pattern requires each decompressed data chunk to be then
_copied_, from the algorithm's 32k buffer, to the application's
supplied buffer. That is more copies than necessary.
If the library supplies the buffer, then in the case of zlib
decompression, (and chunked encoding, etc.) it's possible to use fewer
copies. (I'm not saying the libraries we have available make this
practical, btw. - This discussion is more about an optimal API than
about what's practical to implement).
That's just an example. In _general_, a zero-copy interface should
look broadly like this:
a. Application has a "allocate_write_buffer" function, but it is
_optional_ for the library to use it.
b. The library will call the "allocate_write_buffer" function
_only_ when it does not already have the data in a fixed
location due to algorithms such as chunked decoding, zlib and
SSL decryption. So, for example, a direct recv() from the
socket, after reading headers, or in the middle of a large chunk
of chunked encoding, would let the application select the
buffer. It would do the same if it's using a
decompression/decryption library where that library's API forces
a copy anyway, to avoid a second copy.
c. The applications "write_callback" function must accept any
combination of buffer that were allocated by the application, and
buffers provided by the library.
d. The library should be able to specify whether any buffer that
_it_ provides to "write_callback" is writable in place by
"write_callback" or not. This is needed for minimal copying by
sinks that further filter the data.
e. When data is available only in non-contiguous memory regions,
the data must not be copied to make it contiguous. Instead,
"write_callback" should be called more than once, or it should
accept a list of buffers.
e. The library and application should be able to negotiate how long
they can retain library-allocated buffers after the callback
returns. This is so that the write callback can gather multiple
buffers without copying the data, if (for example) the
application is intending to write the data using sendmsg() in
chunks of a certain minimum size, or otherwise process more data
than is available in one contiguous memory region from the
> I understand, I'm sorry but I thought you were using something similar to
> ioctlsocket(yoursocket, FIONREAD, &available_data_size);
> To determine it, but probably this interface is not available on all
> socket layers.
That's correct. It is not always available.
But more importantly: even that does not provide zero-copy in general.
Think about this:
In one call to recv(), the library reads HTTP headers, plus the
first 1000 bytes of the data. The library _cannot_ know how long
the headers are, until it parses them. So it will always read some
of the data in the recv().
If you're serious about avoiding copies, that 1000 bytes of the
data would not be copied. But your interface forces those bytes to
FIONREAD doesn't change this.
Received on 2005-12-05