Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention encoding in curl_easy_escape docs #1612

Closed
jeroen opened this issue Jun 24, 2017 · 7 comments
Closed

Mention encoding in curl_easy_escape docs #1612

jeroen opened this issue Jun 24, 2017 · 7 comments

Comments

@jeroen
Copy link
Contributor

jeroen commented Jun 24, 2017

The manual pages for curl_easy_escape and curl_easy_unescape should mention which character encoding is used for const char * url if we escape e.g. a Chinese word.

I assume it is UTF-8 which means that e.g. on Windows the user needs an additional call to iconv() to convert it to the native encoding. Currently this is not obvious.

@bagder
Copy link
Member

bagder commented Jun 24, 2017

URIs are per definition (RFC 3986) ASCII only, so there's no "encoding" at all to speak of. How do you suggest we make this clearer?

@jeroen
Copy link
Contributor Author

jeroen commented Jun 26, 2017

Perhaps I misunderstand it then. I thought url-encoding was used to map non-ascii arguments into ascii ones. For example if to post a form:

> curl_escape("food=寿司")
 "food%3D%E5%AF%BF%E5%8F%B8"

> curl_unescape("food%3D%E5%AF%BF%E5%8F%B8")
 "food=寿司"

But this requires we know the character encoding of the output, right? Maybe I'm mistaken.

@bagder
Copy link
Member

bagder commented Jun 29, 2017

Welcome to the mess of URLs. libcurl supports URLs as defined by RFC 3986 (with some "extensions"), while browsers (mostly) support the WHAT WG URL spec. This is a reason for "interesting" differences and I've collected a few of them in an URL interop issues document.

A URL in libcurl cannot legally contain any 8bit characters as that's not allowed by the spec! (the exception to this rule is in the host name part which libcurl will decode and handle). But libcurl doesn't filter out 8bit characters, it is liberal and will instead accept them and just pass them on as-is. libcurl assumes that you passes in a valid URL that you wan to work with.

If you want to pass on "寿司" (or similar) in a URL you probably want to encode it using percent encoding - somehow. The libcurl escape/unescape functions will URL-encode/decode for you, but they both simply work on binary data and they have no knowledge or awareness of specific encodings.

@jeroen
Copy link
Contributor Author

jeroen commented Jun 29, 2017

Perhaps it's useful to warn the user (especially on windows) that the convention is, and the server might be expecting the url-encoded text to be UTF-8. See also this discussion.

Note that browsers, even on windows, always use to UTF-8 when posting a form or using JavaScript. For example the default output is:

  • encodeURIComponent("Malmö") (osx) -> "Malm%C3%B6"
  • encodeURIComponent("Malmö") (windows) -> "Malm%C3%B6"
  • curl_easy_escape() on osx (default locale): "Malm%C3%B6"
  • curl_easy_escape() on Windows (default locale): "Malm%F6"

I understand that in C it is the programmers responsibility to think about encoding of a char* but it might still be helpful to add some words of caution from your reply above, to the docs.

@bagder
Copy link
Member

bagder commented Jul 1, 2017

Again, browsers think the WHATWG URL Spec defines how URLs work, while that's not at all a universal law, so they are bound to function different than all the world's URL using software that is written to work with the IETF/w3c URI specs.

Note that in your four examples, the encoded versions (the ones on the right) are the URL formatted ones and the versions of the strings before encoding are just strings. Since libcurl works with URLs, it also assumes that the encoding is already done. The URL you set is the URL you want.

it might still be helpful to add some words of caution from your reply above, to the docs

Can you suggest any wording that you think might've helped you? I assume you mean that these words should be added to the CURLOPT_URL man page or would you have looked elsewhere to find this information?

@jeroen
Copy link
Contributor Author

jeroen commented Jul 5, 2017

@bagder excuse my lack of understanding about this topic. My main concern is not so much CURLOPT_URL but rather CURLOPT_POSTFIELDS which also mentions:

You can use curl_easy_escape to url-encode your data, if necessary. It returns a pointer to an encoded string that can be passed as postdata.

I think some naive users (like me) might mistakenly assume that curl_escape will url-encode the string in UTF-8 form similar to JavaScript encodeURIComponent().

I don't think this is a completely unreasonable expectation; the section about application/x-www-form-urlencoded-encoding in the HTML5 spec says:

  1. If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form. Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding. Otherwise, let the selected character encoding be UTF-8.

So servers will expect strings to be posted as url-encoded UTF-8 unless specifically requested otherwise by accept-charset. Perhaps the manual for CURLOPT_POSTFIELDS could mention:

You can use curl_easy_escape to url-encode your data, if necessary. It returns a pointer
to an encoded string that can be passed as postdata. Note that `url-encode` does not 
perform any character recoding. If the server expects UTF-8 data (the default in
HTML5 forms), Windows clients might need to convert strings to UTF-8 before url encoding.

This is just a suggestion, perhaps I am still misunderstanding the topic :D In that case feel free to close this issue.

@bagder bagder closed this as completed in a126ca8 Jul 7, 2017
@bagder
Copy link
Member

bagder commented Jul 7, 2017

Thanks, I edited the curl_easy_escape man page and now it says this.

@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

2 participants