curl / Mailing Lists / curl-users / Single Mail
Buy commercial curl support from WolfSSL. We help you work out your issues, debug your libcurl applications, use the API, port to new platforms, add new features and more. With a team lead by the curl founder himself.

feature proposal: flag for mirroring an HTTP resource

From: Jannis R via curl-users <curl-users_at_lists.haxx.se>
Date: Sat, 7 Jan 2023 15:40:29 +0100

Hey,

thank you Daniel and all others, for building this awesome tool!

For my proposal, consider a use case that I'll call "mirroring" or "syncing" here:

- I want a local file to reflect an HTTP resource, whenever I have run the mirroring command. Without changes on the remote side, mirroring should effectively be an idempotent operation.
- I want this mirroring process to work as time- and bandwidth-efficiently as possible: It should not redownload any bytes that it has already downloaded, as long as the server provides the means for this.
- If the server *does not* provide the means for efficient mirroring, I want to be sure to have a correct & up-to-date copy, at the expense of time & bandwidth.

In particular, I'm talking about compressible files which are also served un-compressed to provide a low entry barrier. A common example would be large text files (e.g. CSV datasets) served using a standard web server (e.g. nginx), which either has access to pre-compressed variants of them or does on-the-fly compression.

AFAICT, implementing this behavior using the curl CLI is very hard currently, and not with a single call, because there are a few edge cases that the curl CLI doesn't provide config flags for.

## problems

Let me explain some prerequisites first:

Because the HTTP RFCs define [Content-Encoding] (CE) as being a property of the entity, [Range requests] *do not* "make sense" on CE-coded files. Therefore continuing an interrupted downloaded is only possible with a *non-CE-coded* representation of the resource. [Transfer-Encoding] would cleanly solve this problem, but unfortunately it is not widely supported in web servers and has no equivalent in HTTP/2 and HTTP/3 (yet?).

Also, because a CE-coded entity has a different [ETag] than its un-CE-coded equivalent, we *cannot* re-use the CE-coded ETag to continue downloading from the un-CE-coded entity, in oder to make sure we're still downloading the same "version" of the resource!

more details:
- https://github.com/golang/go/issues/30829#issuecomment-476694405
- https://github.com/httpwg/http2-spec/issues/445

Thus, we can only use CE-coding when downloading in one go (and start over after an interruption), and support continuation *for non-CE-coded entities only*.

## workaround

A workaround would be to use curl's --raw flag to download the "opaque" maybe-CE-coded-maybe-not entity with continuation (`-C -`) support, and then manually decode it if the server responded with a Content-Encoding header. As of curl 7.79.1, this doesn't work because
- curl doesn't provide a mechanism to use --etag-compare (to avoid creating an invalid copy if the remote file has changed) with an unfinished/unstarted download;
- when using the [If-Range header] manually instead of --etag-compare, curl *does* continue upon a 206 (If-Range matched, server responds with [Content-Range]), but *doesn't* overwrite the file on a 200 (If-Range did't match, server responds with full body).

This may seem like a niche use case, but it actually prevents curl from downloading a large compressible file in the most traffic- & time-efficient manner!

## proposed solution

Therefore, I would like to propose a new flag (or a set of flags, not sure how to split the described functionality into multiple orthogal flags) that configures curl to
- use If-Range instead of [If-None-Match] to download an entity with continuation (sort of a mixture between `-C -` and --etag-{compare,save});
- still store the previous ETag and compare it with the current one, like with --etag-{compare,save};
- fail if the server responds with Content-Encoding (because then a) byte ranges wouldn't match and b) the ETag is different), *except if --raw is used in addition;
- overwrite the partially downloaded local file if the server responds with 200 (If-Range didn't match);
- either implicitly enable `-z -` (Last-Modified comparison), or at least be compatible with it.

With the proposed --mirror-tmp-file flag (I'm very open to a better name!) and a "cache"/temp file path, mirroring a file could look as follows:

```
curl -f -o data.csv --mirror-tmp-file data.csv.gz 'http://example.org/data.csv'
```

What do you think?

## demo implementation

I have replicated this behaviour by wrapping curl into [a script that parses its output and upon it](https://gist.github.com/derhuerst/745cf09fe5f3ea2569948dd215bbfe1a). By reading it (~200 lines), you should be able to deduce the necessary changes the --mirror-tmp-file flag would have to implement. The attached readme also contains instructions on how to set up Caddy or nginx in order to test this scenario.

A cleaner implementation would probably use libcurl, but I'd argue that – even though wget exists – downloading a file using the curl CLI is common enough that there should be a way to do it "properly".

– Jannis

[Content-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding)
[Range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests)
[Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
[ETag](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag)
[If-Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Range)
[Content-Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Range)
[If-None-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match)

-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-users
Etiquette:   https://curl.se/mail/etiquette.html
Received on 2023-01-07