Re: encoding expectations
Date: Fri, 20 Apr 2018 08:57:12 +0200 (CEST)
On Thu, 19 Apr 2018, Michael Kilburn wrote:
> - string is a sequence of bytes and until you need to interpret it's content
> (for example split out server name out of url) encoding is unimportant
For curl's "hybrid" URLs, and for RFC 3986 URLs, yes. Not for WHATWG URLs.
> - if HTTP protocol defines headers as US-ASCII-encoded string -- libcurl has
> to convert current encoding to US-ASCII
I disagree and we also can't do that without breaking compatibility.
You tell curl what to send and it sends it. If you want to send funny things
in broken encodings, go ahead and curl will send it for you. That's been a
design principle from day one. It makes curl execellent for testing/trying out
your server software as well. It also doesn't prevent your application from
> - if DNS protocol expects bytes of US-ASCII-encoded strings on the wire --
> libcurl has to do the conversion
That is done for the host name part now if curl is built with IDN support.
> - same wrt returning values back to user -- on-wire data has to be
> converted into current encoding
That doesn't happen. curl delivers *exactly* what was sent over the wire and
it doesn't know about charsets and encodings of data. If that needs to be
decoded in any way to be understood by a user, it needs to be done by the
> - this means if given url can't be represented in current encoding --
> libcurl can't be used to access given resource
Again this "url". What url can't be represented in plain ascii?
> - ...therefore it makes sense to introduce encoding-specific API (e.g.
> CURLOPT_URL_UTF8) which explicitly sets encoding of a string to be passed
I could possibly be convinced that this is a good idea, yes. But I also dread
dragging in a world of encoding/decoding problems into libcurl.
> - also, if libcurl takes string and simply forwards it to server without
> conversion (e.g. CURLOPT_HTTPHEADER) -- it has to be documented in big red
> letters, because current encoding is could be incompatible with what server
The *entire header* you pass on to the server may of course be "incompatible"
and so is the encoding of the content in that header. The entire idea with an
option such as CURLOPT_HTTPHEADER is to let the application decide what to
pass on to the server - because the application knows best what to send.
> once we agree on encoding of strings being passed (single-byte "current"
> encoding or UTF8)
curl is firmly in the "no encoding at all, just a collection of bytes" camp
for most things.
> time to decide what to do with content. I.e. how to handle URLs.
> - split out servername (slash and dot are in BECS and encoding is
> single-byte(or utf8) -- so it is guaranteed to work)
> - convert to Unicode, then apply ToASCII() and feed it to name resolution
> mechanism (e.g. gethostbyname/etc)
Then I figure the continued reading would include the finer points of
IDNA2003, IDNA2008, punycode etc. And gethostbyname was deprecated by
getaddrinfo some ten years ago.
> - if not built with WinIDN/libidn -- fail if servername can't be converted
> to US-ASCII
(libidn isn't used anymore, we use libidn2 since a while back.)
We could possibly error out if non-ascii symbols are used in the name without
an IDN library, but that's also not how ping or telnet and other tools work
without IDN support and again not really in the spirit of curl: it uses what
you pass it. If when you pass it crap, it tries to make use of it.
There's nothing that prevents us from passing on unicode strings to
getaddrinfo() even when curl doesn't known about IDN. Like the curl build
Apple ships on macOS which still (*because of that*) can do "curl
> ... or my windows libcurl is built without IDN support. Is there any way to
> check it?
Yes: curl_version_info and the CURL_VERSION_IDN bit. curl -V also displays it.
> In any case, I still need an answer to one of my original questions -- in
> what encoding libcurl expects CURLOPT_URL to be? Is it "current" encoding
> as specified by current locale?
For IDN, the host name is expected to use your current locale. (I'm not sure
for winidn.) The rest of the URL is assumed to be "raw" single bytes where
each byte is a letter and everything else is suitably URL encoded with percent
encoding %20-style. (I should avoid using the word ASCII here since it works
for EBCDIC as well.)
I think a bigger issue for us in regards to IDN right now is to handle IDNA
2003 vs IDNA 2008 to behave closer to what browsers do. As discussed in this
old (closed but not resolved) issue: https://github.com/curl/curl/issues/1441
-- / daniel.haxx.se
Received on 2018-04-20