Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multipart uploads can't upload files > 2.2GB (limitation of size of long in Windows) #425

Closed
thready opened this issue Sep 11, 2015 · 28 comments
Assignees
Labels

Comments

@thready
Copy link

thready commented Sep 11, 2015

The following code uses a long for filesize. March larger files can be uploaded this way if lFileSize is changed from a long to a uint64.

curl_formadd(
             &formpost, &lastptr,
            CURLFORM_COPYNAME, "file",
            CURLFORM_FILENAME, strDestinationFileName,
            CURLFORM_STREAM, &transferItem,
            CURLFORM_CONTENTSLENGTH, lFileSize,   // this is a long --> should be uint64
            CURLFORM_CONTENTTYPE, "application/json",
            CURLFORM_END);
@bagder bagder added the HTTP label Sep 11, 2015
@bagder
Copy link
Member

bagder commented Sep 11, 2015

This is correct, and unfortunate, but there isn't any quick fix available that I can think of. The curl_httppost is a public struct and it uses a 'long' for this - we can't just change that without breaking a lot of existing apps. I would probably say that a sensible way to proceed where would be to consider adding a curl_formadd() replacement function with a slightly modified API.

@thready
Copy link
Author

thready commented Sep 11, 2015

Thanks bagder. We might do it as you suggested and if we have permission we might submit that back to the project for your approval.

@thready
Copy link
Author

thready commented Sep 16, 2015

Hi badger- would you know where I could get a pre-created visual studio project that allows me to compile libcurl ?

@captain-caveman2k
Copy link
Contributor

If you're planning on working from the repo the take a look in the projects directory and run the generate batch file.

If you've downloaded the release archive or a daily snapshot then they are generated for you in the same directory for vc6 through to vc14.

@thready
Copy link
Author

thready commented Sep 16, 2015

Ah, thank you very much captain-caveman! (i love that guy)...

@thready
Copy link
Author

thready commented Sep 17, 2015

I'm curious- would a feature request be merited here? I know I'd love this in the distribution- and with the size of files going around these days, and with the immense popularity of the amazing libcurl- why wouldn't large files be supported this way?

@bagder
Copy link
Member

bagder commented Sep 18, 2015

Call it feature-request, bug or issue, I don't think it changes anything. In order to fix this limitation we need to do changes to code and more importantly someone needs to do these changes. I think this is a problem worth fixing, but I'm not personally able to work on it in the nearest time frame.

@thready
Copy link
Author

thready commented Sep 18, 2015

I'm giving it a shot! :)

jay added a commit to jay/curl that referenced this issue Sep 19, 2015
- Change internal use of contentslength to curl_off_t.

- Expose as contentslength_large in curl_httppost.

Bug: curl#425
Reported-by: thready@users.noreply.github.com
@jay
Copy link
Member

jay commented Sep 19, 2015

I took a look at this shortly after you posted but I haven't had time to work on it. Here is as far as I got if you want to pick up:

https://github.com/bagder/curl/compare/master...jay:fix_large_multipart_uploads?expand=1

In public struct curl_httppost I added a contentslength_large as type curl_off_t. We're still treating contentslength as long but only publicly in that struct; internally (FormInfo and FormData) we'll now treat it as curl_off_t. When the internal struct is converted to the public struct in AddHttpPost I did that with an auto-detect so anything <= LONG_MAX goes in contentslength and anything bigger goes in contentslength_large. The alternative would be we pass both around in the internal structs which we may have to do because...

I wrote the changes based on a quick read of the code, it has not been tested and when I compiled it I got a lot of warnings. Mostly memory functions that expect type size_t but with contentslength are now being passed a curl_off_t. So maybe contentslength_large should be handled separately internally and only be used for certain situations. Though that seems confusing. I'm interested if despite these problems the changes work for you though.

@bagder
Copy link
Member

bagder commented Sep 20, 2015

Seems like doable approach!

What about adding a new bit for the flags, HTTPPOST_LARGESUPPORT or similar (instead of checking for values larger than MAX_LONG), and if that is set we use the new struct entry that is a curl_off_t? Then we could switch our code to always produce that from now on for both CURLFORM_CONTENTSLENGTH and the new CURLFORM_CONTENTSLENGTH_LARGE. Could make the internals a little easier.

@thready
Copy link
Author

thready commented Sep 21, 2015

Hi Jay! Thanks a lot for your contribution on this! I will try your work
today and let you know if it works for me... :)

On 2015-09-19 02:41, Jay Satiro wrote:

I took a look at this shortly after you posted but I haven't had time to work on it. Here is as far as I got if you want to pick up:

https://github.com/bagder/curl/compare/master...jay:fix_large_multipart_uploads?expand=1 [1]

In public struct curl_httppost I added a contentslength_large as type curl_off_t. We're still treating contentslength as long but only publicly in that struct; internally (FormInfo and FormData) we'll now treat it as curl_off_t. When the internal struct is converted to the public struct in AddHttpPost I did that with an auto-detect so anything <= LONG_MAX goes in contentslength and anything bigger goes in contentslength_large. The alternative would be we pass both around in the internal structs which we may have to do because...

I wrote the changes based on a quick read of the code, it has not been tested and when I compiled it I got a lot of warnings. Mostly memory functions that expect type size_t but with contentslength are now being passed a curl_off_t. So maybe contentslength_large should be handled separately internally and only be used for certain situations. Though that seems confusing. I'm interested if despite these problems the changes work for you though.

Reply to this email directly or view it on GitHub [2].

Links:

[1]
https://github.com/bagder/curl/compare/master...jay:fix_large_multipart_uploads?expand=1
[2] #425 (comment)

@thready
Copy link
Author

thready commented Sep 21, 2015

Hi Jay,

Unfortunately I'm getting a send error "Failed sending data to the peer"
(error 55) as a result of curl_easy_perform.

I also tried something myself- I seem to have got lucky with a change
that I'm not sure is valid or not. Everywhere that contentlength was
found, I changed the value directly to a "long long" and files were
uploading for me. Do you think this might be a valid solution? I have my
doubts that it is.

Thanks again,

Mike

On 2015-09-19 02:41, Jay Satiro wrote:

I took a look at this shortly after you posted but I haven't had time to work on it. Here is as far as I got if you want to pick up:

https://github.com/bagder/curl/compare/master...jay:fix_large_multipart_uploads?expand=1 [1]

In public struct curl_httppost I added a contentslength_large as type curl_off_t. We're still treating contentslength as long but only publicly in that struct; internally (FormInfo and FormData) we'll now treat it as curl_off_t. When the internal struct is converted to the public struct in AddHttpPost I did that with an auto-detect so anything <= LONG_MAX goes in contentslength and anything bigger goes in contentslength_large. The alternative would be we pass both around in the internal structs which we may have to do because...

I wrote the changes based on a quick read of the code, it has not been tested and when I compiled it I got a lot of warnings. Mostly memory functions that expect type size_t but with contentslength are now being passed a curl_off_t. So maybe contentslength_large should be handled separately internally and only be used for certain situations. Though that seems confusing. I'm interested if despite these problems the changes work for you though.

Reply to this email directly or view it on GitHub [2].

Links:

[1]
https://github.com/bagder/curl/compare/master...jay:fix_large_multipart_uploads?expand=1
[2] #425 (comment)

@bagder bagder self-assigned this Oct 23, 2015
@bagder
Copy link
Member

bagder commented Oct 23, 2015

I have a patch in progress for this. Stay tuned.

@bagder
Copy link
Member

bagder commented Oct 24, 2015

My suggested take, including updated docs and test case is now in the formadd-largefiles branch

@bagder bagder closed this as completed in ca5f934 Nov 2, 2015
@captain-caveman2k
Copy link
Contributor

Unfortunately we have some compilation warnings under 32-bit Windows builds and possibly 32-bit Linux as well.

We are passing a curl_off_t (which is 64-bits when the compiler supports native 64-bit types) to malloc which takes a size_t which is typically 32-bits in 32-bit land and 64-bits in 64-bit land.

We (I) can add a static cast on to avoid the compilation issues on lines 838, 843 and 844 in formdata.c (or better still use curlx_uztoso() from warnless.c) but my question is this...

Should we be using a size_t here? I appreciate the old limit of a long was rather limiting but I'm not sure if we should be using anything larger than a size_t when malloc() cannot support a memory request that large.

@jay
Copy link
Member

jay commented Nov 7, 2015

Unfortunately we have some compilation warnings under 32-bit Windows

Yup I see that. What if we do something precautionary at the beginning of the function, like this:

--- a/lib/formdata.c
+++ b/lib/formdata.c
@@ -834,14 +834,16 @@ static CURLcode AddFormData(struct FormData **formp,
834:     /* we make it easier for plain strings: */
835:     if(!length)
836:       length = strlen((char *)line);
837:+    else if(length >= (size_t)-1)
838:+      return CURLE_OUT_OF_MEMORY;
839:
   :-    newform->line = malloc(length+1);
840:+    newform->line = malloc((size_t)length+1);
841:     if(!newform->line) {
842:       free(newform);
843:       return CURLE_OUT_OF_MEMORY;
844:     }
   :-    memcpy(newform->line, line, length);
   :-    newform->length = length;
845:+    memcpy(newform->line, line, (size_t)length);
846:+    newform->length = (size_t)length;
847:     newform->line[length]=0; /* zero terminate for easier debugging */
848:   }
849:   else

@captain-caveman2k
Copy link
Contributor

Technically we aren't out of memory - so isn't that a bit misleading?

Personally, I would use size_t for size based variables rather than an offset type variable such as curl_off_t.

This way the API is consistent with memory allocation functions and anyone using it will then know what the limits are.

As I mentioned, I believe this type is an offset type and as such is signed - so on a 32-bit platform where 64-bit native integers aren't supported, the file size is going to appear to be between -2GB and 2GB rather than 0 and 4GB so passing anything larger than 2GB will appear as minus something in these variables - even though the underlying malloc may succeed. In that respect, this introduces ambiguity.

Unfortunately I don't believe we ever introduced a unsigned version of curl_off_t like curl_size_t or curl_length_t for compilers that do support uint64_t and size_t is effectively a uint32_t :(

In summary I would rather see a "curl_size_t" type introduced that can hold the largest unsigned value the compiler supports and would rather use that rather than curl_off_t. Without a "curl_size_t" I would rather use size_t. I would then use that for the API and size_t in the function calls that do the memory allocation and copying so only the top most function has to do the cast rather than every malloc() or memcpy() call.

@jay
Copy link
Member

jay commented Nov 8, 2015

Technically we aren't out of memory - so isn't that a bit misleading?

You're right it is not technically correct, CURLE_BAD_FUNCTION_ARGUMENT would be better I guess. I don't like the curl_size_t name, I think size_t should be exactly whatever it is and not have a counterpart. Also I think any type other than curl_off_t is unnecessary for this.

The length we are passing has to be larger than size_t for AddFormData. Recall we now allow uploading content greater than what a size_t can hold (basis of this issue), so in AddFormData we need to be able to add a length --which may in fact be larger than 2GB-- to the total size. The size_t check must be scoped to type <= FORM_CONTENT when the content measured by length is in memory.

2nd draft I changed it to return CURLE_BAD_FUNCTION_ARGUMENT instead of CURLE_OUT_OF_MEMORY, and added checks for negative integers and overflow.
https://github.com/jay/curl/compare/master...jay:formdata_check_length?expand=1

@bagder
Copy link
Member

bagder commented Nov 8, 2015

I can't see a good reason for a curl_size_t. Let's avoid that as long as we can. The problem here is not really the size_t type, the problem is that we need be able to hold a size that can be larger than size_t, but such a large value in size should never be used to malloc().

@captain-caveman2k
Copy link
Contributor

Also I think any type other than curl_off_t is unnecessary for this

I disagree - By using a curl_off_t, because it is signed, we are limiting the size on 64-bit platforms to half of what malloc() can handle. On such platforms a size_t is then the equivalent of an unsigned long long or uint64_t whilst curl_off_t is the same as a long long or int64_t :( Whilst one can argue that the maximum value of a int64_t is still pretty big it doesn't allow the programmer to do what they can technically do with malloc() :(

The length we are passing has to be larger than size_t for AddFormData

I've not been involved with this issue, so may not have a complete understanding of the problem, and for which I apologise, but I thought the issue was that we could not pass anything larger than a long to AddFormData() which on Windows is a 32-bit word regardless of CPU type and on Linux is either a 32-bit word or a 64-bit word depending on CPU and compiler.

By changing this function to a size_t we are still fixing that issue - if I have understood everything correctly.

If we use a size_t the size can be 0 to ‭4,294,967,295‬ on 32-bit platforms and 0 to 18,446,744,073,709,551,615 on 64-bit.

If we use a curl_off_t then the max size is 2,147,483,647 with 32-bit compilers that don't support integral 64-bit types, such as older versions of Visual Studio for example and 9,223,372,036,854,775,807 on 32-bit compilers that do support these types and on 64-bit platforms. But, and this is where curl_off_t offers nothing over a size_t, on a 32-bit platform with a 32-bit compiler that supports long long, or int64_t we allow the value to be 9,223,372,036,854,775,807 but malloc() can't handle it :(

So my question here is why use anything other than size_t?

I can understand using curl_off_t in other areas of curl - as we want the smallest and largest signed number possible (such as an offset) but that isn't the case here - we want a size above zero as a length cannot be a negative number.

I don't really want to introduce a new type but if we want to use a type that represents the largest unsigned number possible then we need one - such as curl_size_t / curl_length_t. The name is irrelevant as I don't really mind what it is called - but it would be the equivalent to an unsigned long long or uint64_t where it is supported and a unsigned long or uint32_t where it isn't.

Maybe I am missing something but I really wouldn't want to repeat the mistakes of the past and use a signed type for APIs that only accept unsigned values.

@jay
Copy link
Member

jay commented Nov 9, 2015

Steve, even for posterity 9,223,372,036,854,775,807 is huge. We could introduce a new maximum unsigned type that's exposed to the user, but I'm not convinced by your argument that it's needed for this. There may very well be a good use for it somewhere else, I don't know. Thumbs down from me if you call it curl_size_t though.

So my question here is why use anything other than size_t?

User wants to upload 6GB of content but can't use CURLFORM_CONTENTSLENGTH type long if it can't represent that value. Now user can use CURLFORM_CONTENTLEN type curl_off_t which in most cases should be long long and, based on my read of acinclude and curlbuild, is at worst going to be a long in which case SOL.

If length is a size_t then we have a problem because our user's content length value (whichever one and whichever type) may not fit in a size_t. Take a look at https://github.com/bagder/curl/blob/curl-7_45_0/lib/formdata.c#L851 . If length is a size_t then we are limited by what we can add.

@jay
Copy link
Member

jay commented Nov 9, 2015

Now user can use CURLFORM_CONTENTLEN type curl_off_t which in most cases should be long long and, based on my read of acinclude and curlbuild, is at worst going to be a long in which case SOL.

I'm going to backtrack on being certain that at worst curl_off_t is going to be a long. It looks as though if long failed during testing (64/32/16) for curl_off_t, which I can't see how that could possibly happen, then it can be a 16 bit int apparently. I don't understand why it's written like that. https://github.com/bagder/curl/blob/curl-7_45_0/acinclude.m4#L2884-L2893

@bagder
Copy link
Member

bagder commented Nov 9, 2015

I don't understand why it's written like that.

It puzzles me too. I also fail to see which platform that would fail all other tests and then match the 16 bit one. A 16bit int on DOS for example would still match 'int' in the "32 bit" check afaics.

I say we can remove that 16bit check.

jay added a commit to jay/curl that referenced this issue Nov 9, 2015
Because it's illogical to check for a 16-bit curl_off_t.

Ref: curl#425 (comment)
@jay
Copy link
Member

jay commented Nov 9, 2015

A 16bit int on DOS for example would still match 'int' in the "32 bit" check afaics.

It looks as though the 32 bit check is checking if int is exactly size 4 and if it's not then it's going to continue on to the 16 bit check. But as far as I know at that point there's no long that's 16 bits. It shouldn't happen according to the standard. The only way I could possibly see config doing the 16-bit check is if long is larger than 8, and long long is either larger than 8 or doesn't exist and all the other types tested for 32 and 64 bit are not sized that way which seems really unlikely.

I say we can remove that 16bit check.

Landed in 2200bf6.

@jay
Copy link
Member

jay commented Nov 9, 2015

It's since occurred to me that if a void pointer is less than 4 then there is no suitable size for curl_off_t, however that would have also been true before the recent change I just made (assuming there is no sizeof long ==2) so I guess it's never come up.

@bagder
Copy link
Member

bagder commented Nov 9, 2015

Surely it'll be a while until we get a problem with curl_off_t being signed so that we cannot use that to represent a whole 64bit memory space for a malloc().

I would say that the API quirk we have to overcome here is that we use that single size (now CURLFORM_CONTENTLEN) for both a size in memory as well as size on disk, depending on the presence of other options. So we must read it as curl_off_t and then convert it to size_t later if it needs to be made as a malloc.

@jay
Copy link
Member

jay commented Nov 18, 2015

we must read it as curl_off_t and then convert it to size_t later if it needs to be made as a malloc.

@bagder That is what I did in formdata_check_length, so if you want to see something like that added please give me your feedback on the changes. The overflow check I don't like, we can't assume it will wrap or that ub will result in no change. The right way to handle it would be add min and max macros and test against that. like if(foo > CURL_OFF_T_MAX - *size) //overflow. Can we always assume 9223372036854775807 if 8? I'm concerned specifically about these compilers that use i64 and similar since c89 doesn't guarantee 9223372036854775807. Also is curl only twos complement and does curl function with padding bits or trap representations? I wonder if something like this would be acceptable:

#if (CURL_SIZEOF_CURL_OFF_T == 4)
#define CURL_OFF_T_MAX CURL_OFF_T_C(2147483647)
#elif (CURL_SIZEOF_CURL_OFF_T == 8)
#define CURL_OFF_T_MAX CURL_OFF_T_C(9223372036854775807)
#endif
#define CURL_OFF_T_MIN (-CURL_OFF_T_MAX-1)

jay added a commit that referenced this issue Dec 7, 2015
- If the size of the length type (curl_off_t) is greater than the size
of the size_t type then check before allocating memory to make sure the
value of length will fit in a size_t without overflow. If it doesn't
then return CURLE_BAD_FUNCTION_ARGUMENT.

Bug: #425 (comment)
Reported-by: Steve Holme
@jay
Copy link
Member

jay commented Dec 7, 2015

I pushed a partial patch for this at 738b0ba that will make sure we don't overflow when (size_t)length. It's partial because I sidestepped the issue of the later overflow addition checks and left them out. Note this patch depends on size_t being unsigned, as it did in my drafts. If that can't be guaranteed then we'll need a different way to get max size_t. I've always done (size_t)-1 to get size_t max without issue so that's the way I did it. It's compliant as long as size_t is unsigned.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

4 participants