Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curl can't open Unicode filenames in Windows #345

Closed
z0hm opened this issue Jul 14, 2015 · 26 comments
Closed

curl can't open Unicode filenames in Windows #345

z0hm opened this issue Jul 14, 2015 · 26 comments

Comments

@z0hm
Copy link

z0hm commented Jul 14, 2015

WinXP SP2, cURL 7.43.
cURL can't open a file for transmission when the file name contains characters from a different code page in OS. cURL - don't support unicode?

@z0hm z0hm changed the title cURL don't open files with umlauts in names Unicode. cURL don't open files with umlauts in names. Jul 14, 2015
@dfandrich
Copy link
Contributor

Can you give an example including logs?

@jay
Copy link
Member

jay commented Jul 14, 2015

I know I've heard of this issue before but I can't find it. For example a UTF-8 encoded batch file like this won't work:

chcp 65001
curl -F filedata=@И.txt http://website
curl: (26) couldn't open file "?.txt"

65001 is the UTF-8 codepage and the И is UTF-8 encoded there. Same thing with cyrillic code page. In Process Monitor I can see that the error is "NAME INVALID" but I don't know why that error.

capture

"19","ntdll.dll","NtCreateFile + 0x12","0x773b0112","C:\Windows\SysWOW64\ntdll.dll"
"20","KernelBase.dll","CreateFileW + 0x35e","0x758ac5fd","C:\Windows\SysWOW64\KernelBase.dll"
"21","kernel32.dll","CreateFileW + 0x4a","0x759c3f56","C:\Windows\SysWOW64\kernel32.dll"
"22","kernel32.dll","CreateFileA + 0x36","0x759c53b4","C:\Windows\SysWOW64\kernel32.dll"
"23","msvcrt.dll","clearerr_s + 0x75b","0x76a1a310","C:\Windows\SysWOW64\msvcrt.dll"
"24","msvcrt.dll","sopen_s + 0x79","0x76a1a789","C:\Windows\SysWOW64\msvcrt.dll"
"25","msvcrt.dll","sopen_s + 0x1b","0x76a1a72b","C:\Windows\SysWOW64\msvcrt.dll"
"26","msvcrt.dll","remove + 0x137","0x76a1a628","C:\Windows\SysWOW64\msvcrt.dll"
"27","msvcrt.dll","fsopen + 0x6a","0x76a1a6c1","C:\Windows\SysWOW64\msvcrt.dll"
"28","msvcrt.dll","fopen + 0x12","0x76a1b2d6","C:\Windows\SysWOW64\msvcrt.dll"

The CRT specific locale or something might need to be changed, or use the version of the command line that's UTF-16 encoded and work with that and _wfopen, maybe.

@jay jay changed the title Unicode. cURL don't open files with umlauts in names. curl can't open Unicode files in Windows Jul 14, 2015
@jay jay changed the title curl can't open Unicode files in Windows curl can't open Unicode filenames in Windows Jul 14, 2015
@z0hm
Copy link
Author

z0hm commented Jul 15, 2015

WinXPSP2 RUS (CP for non unicode app -1251).

lua script in utf8 1.lua:
---------------------------
local function fread(f) local h,x = io.open(f,"rb"),nil if h then x=h:read("*all"); io.close(h) end return x end 
local s1=fread("1.txt") 
local s2=fread("2.txt") 
curl ... -T '"'..s1..'"' ...
curl ... -T '"'..s2..'"' ...

---------------------------

1.txt in utf8 (65001)
--------------
Read Me.txt

--------------

2.txt in utf8 (65001)
--------------
Første Pucambù.txt

--------------

command line for run: lflua.exe 1.lua

lflua.exe -- lua 5.1 interpreter from unicode FAR3 (http://www.farmanager.com/index.php?l=en)

Read Me.txt -- transfer good
Første Pucambù.txt -- don't transfer, curl answer: don't open file

@dbyron0
Copy link
Contributor

dbyron0 commented Jul 16, 2015

I agree that WinMain (or perhaps GetCommandLineW) and _wfopen look like the way to address this. If it's helpful to include the filename in logging/debug output, it may also mean changing the logic here: https://github.com/bagder/curl/blob/master/lib/curl_multibyte.c#L25 to get these functions whenever we build for windows.

I imagine this is a big enough blob to bite off at once, but I am tempted to mention that if we want support for long (~> 260 characters) file names on windows, _wfopen isn't sufficient and we need to drop down to CreateFile/ReadFile/WriteFile and deal with the joy of HANDLEs. See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#maxpath for some of the details.

@vszakats
Copy link
Member

Not only GetCommandLineW(), but all string (filename) interactions via Windows API should be done using the "WIDE" variant of said API functions. F.e. in above case it'd mean calling CreateFileW() instead of CreateFileA(). And because this one is called via C RTL function fopen(), using _wfopen() may be used in this case. I would opt to go with the direct API calls to avoid messing with C RTL codepage and compatibility issues altogether. Another question to consider here is what should be the UNICODE encoding expected by libcurl API functions that expect or return strings. For portability and to retain ABI compatibility, probably UTF-8 would be best.

Another issue to tackle (if this is a concern) is how to stay compatible with Windows versions that do not natively support UNICODE ("WIDE") API variants. These Windows versions are Win95/Win98/WinME and these will either need a special build that keeps using non-WIDE ("ANSI") API variants, or use the unicows.dll layer to make them work transparently with WIDE ones. Not sure how to handle this when dealing with C RTL functions. Supporting these old versions for certain needs a C compiler that is also compatible with them; MinGW is, and MS Visual Studio 2005 or older are (plus most other 3rd party C compilers).

Another potentially interesting note is that WinCE OS only supports the "WIDE" APIs (with some minor exceptions probably not relevant in context of libcurl). Such differences should be hidden by the C RTL layer, if used.

@DemiMarie
Copy link

The Microsoft-recommended method of handling Unicode/ASCII compatibility is to

#define UNICODE 1
#define _UNICODE 1

in a header included by every source file. Then use the un-suffixed versions of the functions.

Other notes:

  • As mentioned above, never call fopen() -- use _wfopen(). This probably needs to go in a compatibility shim layer that abstracts over this.
  • Any write to a file not opened by libcurl must be considered to potentially be a Windows console handle. In this case, the CRT functions must not be used, as they do not support Unicode -- one must use WriteConsoleW() and do all buffering manually. This is especially serious for the command-line curl utility.

@bagder
Copy link
Member

bagder commented Oct 5, 2015

I'll welcome a patch from someone that has been tested with a fair degree of success (on Windows).

mkllnk added a commit to git-ftp/git-ftp that referenced this issue Dec 13, 2015
The NUL byte is a unique separator. When using NUL, filenames don't need
to be escaped and we can handle all kinds of special characters in file
names.

That said, there is no Windows support for these characters at the
moment.

- `sort -u` thinks that 'a' equals 'ä' and therefore omits 'ä'.
- `curl` fails to open files with unicode in their name curl/curl#345
@jay
Copy link
Member

jay commented Feb 11, 2016

I was kind of hoping someone would pick up on this but it's about that time it goes in the TODO. To see how this might work I wrote a draft with the idea to convert the command line arguments to UTF-8 (see the discussion in #637) so we can continue to pass the user input around as char *. When files are opened or statused then convert from UTF-8 to UTF-16 encoding. There is no way to set the locale as UTF-8 so fopen will open as UTF-8, afaict.

Things like this work:

-v --output спасти.txt http://россия.net
-v -F filedata=@спасти.txt http://example.com/

The first one the host is an IDN and it's converted to UTF-8 and if a WinIDN build is used it's later converted to punycode (xn--h1alffa9f.net).

Has some problems though like the whole URL is now UTF-8. output to the screen is not UTF-8 so any command line input that is output to the screen is incorrect:

* Connection #0 to host Ñ?D_Ñ?Ñ?D,Ñ?.net left intact

The way I did it was sloppy, just to get it working to see how it might work. I'm using a global variable in the DLL g_curl_tool_args_are_utf8 and also I have some duplicated code like the fopen and stat wrappers. But it seems to work in the case of filenames and hostnames.

The draft is here:
https://github.com/curl/curl/compare/master...jay:win-utf8-test?expand=1

@jay
Copy link
Member

jay commented Apr 6, 2016

Since we don't have time to work on this right now it's been added to KNOWN_BUGS. 9f740d3

@Karlson2k
Copy link
Contributor

I'm willing to implement a solution for curl.
But at first curl devs need to choose how they want to handle all this staff in curl tool and libcurl.
Currently curl is not very good documented how it use given strings.
Seems that libcurl treat all string as encoded in "locale encoding", which is definitely not the best choice:

  • depending on platform and settings encoding can be changed for thread, for process or for system - so it's not thread-safe
  • some application change locale on fly to "C" and back (as they need decimal point instead of decimal comma or they need to change case of US-ASCII-only symbols)
  • locale encoding can be limited - you can't convert GBK/CP936/GB18030/BIG5 text to CP1251/CP866/KOI8-R and vice versa. Text will be lost.

My suggestion:

  • for libcurl:
    • treat all given urls as UTF-8 encoded
    • treat all other text (usernames, passwords) as is and don't attempt to convert it
    • output all text from remote servers as is or converted to UTF-8
  • for curl tool:
    • convert all input urls to UTF-8
    • configure libcurl or convert by itself output

Later more smart processing can be added: autodetect text encoding for Web and other servers according to standard (several levels of detection with priority: HTTP header, HTML header, direct detection by first few bytes) and automatic conversion of GET and POST data to required encoding (must be the same as encoding of page with HTML form).

Anyway, it must be documented how libcurl deals with encoding.

@bagder
Copy link
Member

bagder commented Apr 6, 2016

Then let's take it step-by-step:

  1. Document how it works now. That's important since we cannot introduce behavior changes without very careful considerations.
  2. Then work on introducing something that can make the handling consistent between platforms.

@Karlson2k
Copy link
Contributor

@bagder OK, where and how we can start documenting current behavior?

@bagder
Copy link
Member

bagder commented Apr 6, 2016

I'd say probably in the curl_easy_setopt.3 and the curl.1 man pages. Perhaps a new "text encoding" section would be suitable. Or what do you think?

@Karlson2k
Copy link
Contributor

Currently encoding in libcurl/curl is some kind of mess. If we document it in curl documentation for end users, users can start modifying theirs programs and scripts to much updated documentation.
I'd prefer to create some simple internal document (can be github issue for example) to simplify taking decisions, and then, based on decision - create PRs with code and documentation updates.

@bagder
Copy link
Member

bagder commented Apr 6, 2016

I'm fine with that too (and weirdly enough I don't think it is a "mess" ;-)

@Karlson2k
Copy link
Contributor

@bagder Yep, curl is the best. 😃 Just need a little bit improvement. 😉

@andrewchernow
Copy link

andrewchernow commented Jun 6, 2017

I am using libcurl on a new project and came up with a solution. It that replaces all fopen/open calls with their wchar_t versions, for anyone looking for a quick and dirty hack.

Note that if using openssl with curl, some file handling is done by openssl: ie. SSL_CTX_use_certificate_chain_file. However, OpenSSL properly uses _wopen. I don't know about other SSL libraries that curl supports.

I created an fopen/open macro in curl_setup.h just after including io.h (also after stdio.h).

#define fopen _wfopen_hack
#define open _wopen_hack

__declspec( dllexport ) FILE *_wfopen_hack(const char *file, const char *mode);
__declspec( dllexport ) int _wopen_hack(const char *file, int oflags, ...);

I then added the implementation to lib/file.c (excluded error checking)

FILE *
_wfopen_hack(const char *file, const char *mode)
{
    wchar_t wfile[260];
    wchar_t wmode[32];

    MultiByteToWideChar(CP_UTF8, 0, file, -1, wfile, 260);
    MultiByteToWideChar(CP_UTF8, 0, mode, -1, wmode, 32);

    return _wfopen(wfile, mode);
}

int
_wopen_hack(const char *file, int oflags, ...)
{
   wchar_t wfile[260];
   int mode = 0;

   if(oflags & _O_CREAT)
   {
      va_list ap;
      va_start(ap, oflags);
      mode = (int)va_arg(ap, int);
      va_end(ap);
   }

   MultiByteToWideChar(CP_UTF8, 0, file, -1, wfile, 260);

   return _wopen(wfile, oflags, mode);
}

I logged to a file within those functions to ensure they were being called. I also did a strings check for "open" and only found my _wfopen_hack and _wopen_hack symbols.

@bagder
Copy link
Member

bagder commented Jun 6, 2017

Cool! So what's the downside with this approach? Or perhaps put differently: why aren't you suggesting this as a real pull request?

@andrewchernow
Copy link

Good question. I guess I didn't find this solution all that elegant (after thought); although it does solve the issue. My solution only focused on FS stuff, thus may not be a 100% fix for unicode problems. In addition, this forces the API user to use wide versions, verse giving them the control to enable/disable them. Not sure if that can cause breakage for existing applications.

Ideas to make this more committable:

  • use an open/fopen option with setopt to set a callback for file opens.
  • Add a flag to curl_global_init to enable wide versions of open calls. (my favorite)
  • instead of a macro, actually change the call sites for open/fopen with a Curl_open or Curl_fopen.

Andrew

@andrewchernow
Copy link

Oh yeah, the version I am actually using doesn't hard code 260 for path buffer size; which is MAX_PATH on windows and is essentially a meaningless value, since windows can support path lengths up to 32767. My version queries the conversion size with an additional call to MultiByteToWideChar and then allocates the buffer.

@jay
Copy link
Member

jay commented Jun 6, 2017

@andrewchernow thanks for taking a shot at it but I don't think it can be done in the way you propose. Unicode characters are converted to a local codepage (eg "ANSI") in Windows in a way that can be lossy. In other words if you have some russian unicode and it's converted to american ansi then the actual glyphs I'd posit are lost so when you convert it back you're not guaranteed the same thing.

the version I am actually using doesn't hard code 260 for path buffer size; which is MAX_PATH on windows and is essentially a meaningless value

Have you ever tried to open files in a folder in explorer at a depth greater than max path? It is not at all a meaningless value. In my experience some Windows API functions W (Unicode) simply do not function correctly with paths larger than max path (or some number a little larger), despite what documentation implies. For that reason it's better to use max path. (edit: I'm going to back away from this a bit -- it's not "better" to use max path but in my opinion there's just not much advantage to supporting longer paths, though it's fine to do so and probably a good idea, maybe Windows 10 has better support. I just take issue with it being a "meaningless" value.)

@andrewchernow
Copy link

Unicode characters are converted to a local codepage (eg "ANSI") in Windows in a way that can be lossy

If you use ANSI functions to manage file...then yes. NTFS stores file names as UTF-16, not ANSI. Thus, if you start with UTF-8, convert it to UTF-16 and then use a wide function to access the file system, ANSI doesn't play a role. I'm not sure where you are suggesting ANSI is injected.

In my project, I was solving the issue of supplying certificate and key file paths. My application acquires them using non-ANSI functions, converts them to UTF-8 and passes them to setopt: like CURLOPT_PINNEDPUBLICKEY which uses an fopen call within curl. Using my solution, that fopen call becomes a _wfopen call with a wide converted path. In this case, no ANSI conversion ever occurs.

Have you ever tried to open files in a folder in explorer at a depth greater than max path

Yes, I have. It's rather hilarious. However, your assertion that this means >MAX_PATH is somehow wrong or invalid, is a bit misguided. That path most likely points to a valid object in the file system. Punting a request to open such it seems like a bug; what if it was a public key or a file to upload? However, having to prefix a >MAX_PATH path with \\?\ also seems like a bug/hack ;) Windows 10 has a way to enable long paths now, I don't think the prefix is needed on win10.

Anyhow, the use cases I need are working: supplying keys/certs to setopt, uploading, downloading.... Part of why I said "may not be a 100% fix for unicode problems". I just thought it may be useful to someone else.

@Karlson2k
Copy link
Contributor

For multiplatform projects is most correct way is to use UTF-8 internally and convert (if needed) on input/output and for filesystem access.
For W32-only project, the most correct way is to use WCHAR/wstring internally everywhere.

@jay
Copy link
Member

jay commented Jun 6, 2017

In my project, I was solving the issue of supplying certificate and key file paths.

I see. I was thinking of filenames provided to the curl tool on the command line, which are encoded as ANSI as described in this issue (eg argv[1], argv[2] etc). Should we try to convert them back to Unicode via UTF-8 some information would be gone. I had proposed earlier in the thread converting from UTF-16 (GetCommandLineW) to UTF-8 but that had other problems then because Microsoft's CRT doesn't work with UTF-8 as a locale.

However, your assertion that this means >MAX_PATH is somehow wrong or invalid, is a bit misguided. That path most likely points to a valid object in the file system.

Yes I agree, my assertion was too much. Shortly after I wrote it I had edited it, but I suspect github sent out the e-mail update before then.

@andrewchernow
Copy link

command line, which are encoded as ANSI as described in this issue (eg argv[1], argv[2] etc)

Very true, I see what you are saying. You know you can change the codepage of the command prompt via chcp (change codepage) and set it to UTF-8 chcp 65001. Older windows have some display issues, but underneath the data is correct.

Should we try to convert them back to Unicode via UTF-8 some information would be gone

True, but I don't think this is curl's issue. If curl is given malformed UTF-8, then stuff will break. Garbage in, garbage out.

Rather than GetCommandLineW, I'd suggest using the CRT wmain function. It is just like main() but from process startup, arguments and the environment are UTF-16. No ANSI conversion at all. Then you can use WideCharToMultiByte(CP_UTF8, wide_argv[X]) until the cows come home.

@harvald
Copy link

harvald commented Dec 14, 2017

Any progress on this? I have the same issue. Anyone know some workaround?

blattersturm added a commit to citizenfx/curl that referenced this issue Mar 16, 2018
@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

10 participants