curl / Mailing Lists / curl-library / Single Mail

curl-library

Re: encoding expectations

From: Michael Kilburn <crusader.mike_at_gmail.com>
Date: Wed, 25 Apr 2018 11:56:02 -0500

 On Tue, 24 Apr 2018, Daniel Stenberg wrote:

> I would have said unconditionally yes, but then your comment about winidn
made
> me look that up and yes that seems to require that the host name is
provided
> as UTF-8 for it to work! I find that a little odd, but I can live with it.

Probably because on Windows "current" encoding (i.e. ACP/OEMCP) can never
be UTF8 (or any other that can represent entire Unicode charset) --
therefore regardless how you compile/configure your app -- there will be
urls you couldn't use. On Linux it isn't a problem as UTF8 is pretty much
given.

In any case, it would be really helpful to mention all this on
CURLOPT_URL help page.

> But yes, if you want to be truly sure that there's a specific build
combination for
> you to use, you better build your own.

I wouldn't mind dealing with non-IDN builds of libcurl, since there is a
workaround for user -- to use punycode'd hostnames in urls explicitly. But
there could a problem -- SSL certificates validation. If certificate is
produces for IDN (and not it's punycode version) -- this workaround will
fail.

Googling related topics shows that people suggest putting punycode into
certificates -- this will "save" the workaround, but there are two
questions here:
- is is a standard or just a suggestion?
- does libcurl convert IDN to punycode before calling on underlying SSL
library to verify certificate?

>> How does it work for EBCDIC without converting it to ASCII? these parts
of
>> url should end up on HTTP header somewhere and HTTP standard requires
>> header to be US-ASCII encoded (afaik).
>
> EBCDIC systems do a whole lot of converting back and forth to/from ascii
for
> the various protocols.

This can't be done at socket's send/receive level -- there is no way to
figure out which bytes need to be re-encoded and which aren't. Are you sure
libcurl doesn't do any "current" -> ASCII conversions and still works on
mainframes/etc?

Maybe on these systems libcurl expects input to be in ASCII-compatible
encoding and uses ASCII-versions of symbols used to parse user input ('.'
'@' '/' and etc)? I.e. conversion burden is on user code and licurl only
takes steps to cancel effects of non-ASCII execution character set within
itself.

Regards,
Michael.

-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.haxx.se/mail/etiquette.html
Received on 2018-04-25