curl / Mailing Lists / curl-library / Single Mail

curl-library

Re: encoding expectations

From: Michael Kilburn <crusader.mike_at_gmail.com>
Date: Thu, 19 Apr 2018 21:46:51 -0500

 On Thu, 19 Apr 2018, Daniel Stenberg wrote:
> On Wed, 18 Apr 2018, Michael Kilburn wrote:
>> For example, it seems that CURLOPT_URL expects a string that is encoded
>> according to current C locale(?). And if user encoding doesn't support
>> certain symbol -- curl can't talk to a server who's name contains that
>> symbol.
>
> It's even worse than so. First, let me mention my recent talk about "the
sorry
> state of URLs" from curl up 2018:
> https://curl.haxx.se/video/curlup-2018/2018-04-15_Daniel-S
tenberg-urls.webm
>
> URLs, by curl's definition, and by the original RFC3986 definition is
ASCII
> only. You can use IDN domain names (if curl was built with that feature
> enabled), and then curl will use the currently set locale when trying to
> convert it to puny code.
>
> Outside of the domain name, anything that isn't ASCII is not RFC 3986
> compliant. But curl uses a forgiving approach as you might want to sent
> rubbish to your server so as long as curl can figure out the individual
parts
> it will pass on what you passed in. So UTF-8 or any other encoding will be

> accepted and passed on.

I spent some time looking into "encoding" subject and holy crap it is a can
of fat worms... I'll summarize what I've learned here and base my
suggestions on it (please correct me if I am wrong somewhere):

- string is a sequence of bytes and until you need to interpret it's
content (for example split out server name out of url) encoding is
unimportant
- there are multi-byte and single-byte encodings and in APIs (e.g. C
runtime) string by default is considered to have single-byte encoding
(unless given function spec specifically contradicts that) -- it seems a
fair default that most software follow
- there is such thing as "basic execution character set" (BECS) -- it
defines 96 symbols, as well as their numerical representations in final
(compiled) executable. I.e. this dictates how your "abc" will be
represented in memory as char[4] during execution. C standard defines
symbols, but doesn't define their numeric values (with exception of '\0'
symbol -- it's value is always 0). Note that '@' is not part of BECS
- there is such thing as "execution character set" (ECS) -- which is BECS
plus everything else given platform considers to be part of given encoding.
- there is such thing as "native encoding" -- and this is the most elusive
thing I've seen in my life as I couldn't find any definition of it
anywhere. But, apparently, if you compile something on given platform and
don't specify ECS explicitly -- whatever compiler chose for you is the
"native encoding"
- there is "locale" -- which is (amongst other things) specify "current"
encoding for single-byte encoded strings -- related C runtime functions are
supposed to honor it and same is expected from user code. See setlocale()
in manpages.

- program always starts in locale named "C" (aka POSIX locale)

- 'setlocale(LC_ALL, "")' will update current locale to whatever
environment suggests (via env variables on Linux/etc) -- i.e. program
decided to honor user request to use certain locale

- note that because BECS gets set during compilation and locale is set at
runtime -- this means that all possible locales (given OS provides) have to
be binary compatible with any BECS (given OS supports) or your program will
break because '\n' value in the code will differ from '\n' passed in

- this means that encoding provided by "C" locale isn't guaranteed to be
US-ASCII compatible -- it has to be provided, has to be binary compatible
with BECS and BECS doesn't specify char values

- same for all possible BECS -- they either have to be compatible on binary
level or you OS should use other means to prevent you from mixing
incompatible BECS in the same app (for example by refusing to load an
incompatible library)
- sidenote: I guess this also means you need to perform separate
compilation of your app for each version of windows (Japanese/etc) because
their BECS differ

I kinda used encoding and charset as synonyms here... In any case these
links provide very good info on this topic:
https://stackoverflow.com/questions/3768363/character-sets-not-clear
https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html

So, wrt curl library here is what I (as a user) would expect:
- any narrow string (i.e. char*) parameter/return value is expected to be
in encoded using current locale's single-byte charset (unless explicitly
specified otherwise)
- this means libcurl (or libraries it uses) should perform required
conversions as necessary, for example:

- if HTTP protocol defines headers as US-ASCII-encoded string -- libcurl
has to convert current encoding to US-ASCII

- if DNS protocol expects bytes of US-ASCII-encoded strings on the wire --
libcurl has to do the conversion

- same wrt returning values back to user -- on-wire data has to be
converted into current encoding

- this means libcurl API should be allowing for encoding failures
- this means if given url can't be represented in current encoding --
libcurl can't be used to access given resource

- ...therefore it makes sense to introduce encoding-specific API (e.g.
CURLOPT_URL_UTF8)
which explicitly sets encoding of a string to be passed

- ...note that UTF8 isn't a single-byte encoding and therefore what Linux
did is kind of a hack -- it basically added a special case "if encoding is
UTF8 then each char* is multi-byte-encoded", but since UTF8 is
US-ASCII-compatible -- it seems to work ok (only systems with
US-ASCII-compatible BECS :-) ) I might be wrong here -- don't think I
understand intricacies of using multi-bytes encodings yet.

- also, if libcurl takes string and simply forwards it to server without
conversion (e.g. CURLOPT_HTTPHEADER) -- it has to be documented in big red
letters, because current encoding is could be incompatible with what server
expects

Another option is to put conversion burden on the user -- basically declare
that each string passed to (received from) libcurl is in specific
predetermined encoding (US-ASCII/UTF8/etc). Then most of headaches
described above will become user's, but you'll have to be careful with
functions provided by runtime -- they expect data in current encoding, not
necessarily one you picked.

Sidenote: not sure about Windows -- each dll can have it's own copy of CRT
(via static linking). Does it mean each dll can have it's own locale? Then
what to do with strings that cross dll boundaries?

Now, once we agree on encoding of strings being passed (single-byte
"current" encoding or UTF8) -- time to decide what to do with content. I.e.
how to handle URLs. I'd suggest following what wiki says here:
https://en.wikipedia.org/wiki/Internationalized_domain_name

basically:
- split out servername (slash and dot are in BECS and encoding is
single-byte(or utf8) -- so it is guaranteed to work)
- convert to Unicode, then apply ToASCII() and feed it to name resolution
mechanism (e.g. gethostbyname/etc)
- if not built with WinIDN/libidn -- fail if servername can't be converted
to US-ASCII

- ... add optional override that uses "raw" servername in this case and
write HACK in big red letters around it (and declare it obsolete)

- ... add a note that CURLOPT_URL_UTF8 behaviour may differ from CURLOPT_URL's
in this case

- note that some symbols libcurl may use in it's string interpretation
aren't part of BECS (e.g. @) and therefore it isn't guaranteed that your @
is the same as @ passed by user. You have to convert @ to current encoding
or do interpretation in pre-defined encoding (utf8)

> It's even worse than so. First, let me mention my recent talk about "the
sorry
> state of URLs" from curl up 2018:
> https://curl.haxx.se/video/curlup-2018/2018-04-15_Daniel-
Stenberg-urls.webm

I watched it. Tbh I didn't see any particular problems besides political
games within/between standardization bodies. At least I didn't notice
anything that makes my problem worse.

> This, presumably, because your server end likes the encoding passed in on

> Linux but not the one used on Windows.

... or my windows libcurl is built without IDN support. Is there any way to
check it?

> Encode them yourself properly before you pass them to curl.

yes, I guess that is one way of doing it.

> I hope you don't try to send funny encodings in HTTP headers.

Occasionally I do... But then I end up running into stuff like this:
https://stackoverflow.com/questions/47687379/what-characters-are-allowed-in-
http-header-values

... so I stopped doing it

> I should mention this page: https://curl.haxx.se/docs/vuln-7.51.0.html

Sigh... This app doesn't go to internet, so I'll live with that until next
update.

> I'm always interested in feedback, suggestions and ideas on how to
improve
> libcurl and the subject of URLs and how to deal with them

What do you think about ideas laid out above?

In any case, I still need an answer to one of my original questions -- in
what encoding libcurl expects CURLOPT_URL to be? Is it "current" encoding
as specified by current locale? US-ASCII? UTF8? any encoding which is
binary-compatible with libcurl's BECS -- so that libcurl can split it into
"almost black-box" parts using '/', '.', etc?

-- 
Sincerely yours,
Michael.

-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.haxx.se/mail/etiquette.html
Received on 2018-04-20