curl / Mailing Lists / curl-library / Single Mail

curl-library

Write callback function when following HTTP redirections

From: Nicolas Roeser via curl-library <curl-library_at_cool.haxx.se>
Date: Mon, 15 Apr 2019 17:34:37 +0200

Hello again!

I am still writing my code in PHP, but looking at the code of the PHP
curl extension, I have found my question to be a general libcurl
question; this is why I am writing to this list.

My code uses a write callback function, a header callback function, and
a progress callback function. The latter may cause a download to be
canceled (see my earlier question in thread
<mid:20190407213813.GD11240_at_imap.uni-ulm.de>). I have enabled
CURLOPT_FOLLOWLOCATION, and CURLOPT_HEADER. (The code also enables
CURLOPT_RETURNTRANSFER, but that is specific to the curl extension of
PHP, and merely causes the internal buffer to be output if there is no
error.) The write callback function appends the received data to a
dynamic buffer.

I would like to parse the received data (or the first part of it) even
if the download has been aborted.

My problem is that I do not know where the boundary between header and
body is if the download has been aborted. To make things worse, I have
the feeling that it may be difficult to properly detect.

With the prerequisites listed above, consider the following scenario:
1. client sends a HTTP GET request to the server,
2. server responds with 3xx, Location header field, no Content-Length,
and a body with chunked transfer coding,
3. client reads the chunked body and then follows the redirection,
4. server responds with 200 and sends a huge document (which _might_
contain parts that look like message/http content 😉),
5. client starts reading the resource, but aborts after a certain amount
of bytes.

I would like to clear the receive buffer each time the client starts
reading a new resource. But I am not sure when this can safely be done.
 From the man pages for CURLOPT_WRITEFUNCTION and
CURLOPT_HEADERFUNCTION, I can see that while the header callback
function is called once per header line (to simplify their handling),
the write callback function may be called with big blocks of data. So I
assume that it is _not_ safe to clear the receive buffer as soon as I
see an HTTP status-line.

Guessing the start of the document is not an option of course.

I first thought that I might disable CURLOPT_HEADER and handle some
headers differently from what is done now. But this seems not to help
with my problem of identifying when to clear my receive buffer as long
as CURLOPT_FOLLOWLOCATION is on.

Now I am a bit lost, and assume that I am missing something here. This
is why I would like to ask for help:

How can I extract the body of my target resource which has been
partially received? Are the man pages or my interpretation of them too
strict? Do I need to switch to a completely different approach?

I have a feeling that the write callback function will never be called
with data from two HTTP responses at once (that is, will never cross
redirections). Is my guess correct? If yes, is this guaranteed/will this
stay?

Cheers

-- 
Nico
Nicolas Roeser
kiz – Information Systems Department, Ulm University
-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html
Received on 2019-04-15