Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling URL with http:/// (3 slashes between protocol and domain) #791

Closed
cjbern opened this issue May 5, 2016 · 13 comments
Closed

handling URL with http:/// (3 slashes between protocol and domain) #791

cjbern opened this issue May 5, 2016 · 13 comments
Assignees
Labels

Comments

@cjbern
Copy link

cjbern commented May 5, 2016

I did this

The url is communicating with a live webserver that is returning a malformed location field in the 301 HTTP response.

curl -I -L "http://bozardford.net"

I expected the following

Successful redirect to http://www.bozardford.net

despite the fact that there was an extra slash between the protocol and the domain name in the location field of the 301 response header "http:///www.bozardford.net"

Firefox 45 and Chrome 49 both handle the malformed location field by ignoring the extra slash and redirected according to what was meant.

What I got

HTTP/1.1 301 Moved Permanently
Server: Apache-Coyote/1.1
Location: http:///www.bozardford.net
Connection: close
Content-Length: 0
Date: Thu, 05 May 2016 02:10:29 GMT
X-DDC-Arch-Trace: ,HttpResponse

curl: (6) Could not resolve host: http

curl/libcurl version

curl 7.35.0 (x86_64-pc-linux-gnu) libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

Also happens in pycurl:

'PycURL/7.19.5.3 libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3'

Haven't had time to build a more recent libcurl version to test this on, but I haven't found any previous mention of this problem in the issues on github or in an internet search.

operating system

Ubuntu 14.04 LTS

% uname -a
Linux 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Like I said, it appears that current browsers handle location fields malformed in this way in a manner that makes implicit sense. It would be nice for curl to do this as well, instead of my having to build out specialized redirection handling in my code.

@jay
Copy link
Member

jay commented May 5, 2016

I can confirm both Firefox and Chrome will skip seemingly any number of arbitrary slashes after the scheme. Is that some legacy issue? I don't think it's correct to do that.

@cjbern
Copy link
Author

cjbern commented May 5, 2016

I agree that it isn't correct, and RFC 7230 makes a missing host in an absolute http URI explicitly invalid now (MUST NOT send, if received, MUST treat as invalid). Earlier RFCs never seemed to make that explicit or clear enough, apparently.

Worse, this comment implies that Firefox used to reject HTTP URLs with the wrong number of slashes after the scheme: http://superuser.com/questions/352133/why-do-file-urls-start-with-3-slashes#comment388378_352134

Even worse: this autocorrection appears intentional in Chrome and is wontfix:
https://bugs.chromium.org/p/chromium/issues/detail?id=385645
https://blog.it-securityguard.com/google-chrome-security-multiple-leading-slashes-in-urls-may-confuse-some-server-side-xss-filters/

I don't have a strong opinion on whether you should implement the same autocorrection in curl after researching the above. Like I said it would be nice and it would save me some time, but I can code a workaround.

@bagder bagder added the HTTP label May 5, 2016
@bagder
Copy link
Member

bagder commented May 5, 2016

First, let's not add file:/// to the confusion. It has three slashes in the correct case. http:// is different. A http:// URI cannot work without a host name.

RFC 3986 dictates how URIs work and there's no room for three slashes for HTTP.

But sure, Chrome and Firefox most probably do this for "web compatibility" as we like to call it. Meaning that enough other clients break the specs to make you want to do it as well as otherwise users get upset and think your product is flawed. Similar to handling space in Location: headers, which we already do for exactly that reason.

So, I would not be against a patch that makes curl act like the popular browsers in this regard. After all, people use curl to mimic browsers to a large extent and not acting like browsers in this aspect makes curl not deliver that promise for these users.

@bagder
Copy link
Member

bagder commented May 5, 2016

Firefox does in fact accept one or more slashes for HTTP and HTTPS redirects. I just tested redirects with one slash and I tested with 10 slashes using Firefox. They all redirect fine.

@tomuta
Copy link

tomuta commented May 6, 2016

On that note, it would be nice to have a CURLOPT_REDIRECTFUNCTION option so that an application could easily implements its own behavior by re-writing the redirection URL or rejecting it, causing the transfer to fail. If we had this, one could implement this and support such malformed URLs.

@jay
Copy link
Member

jay commented May 7, 2016

@tomuta You can use CURLINFO_REDIRECT_URL to manually redirect.

#define MAXREDIRS  50
  int redir_count;
  for(redir_count = 0; redir_count < MAXREDIRS; ++redir_count) {
    char *url = NULL;
    res = curl_easy_perform(curl);
    if(res || curl_easy_getinfo(curl, CURLINFO_REDIRECT_URL, &url) || !url)
      break;
    /* redirect needed. this is where you could make a copy of the url and modify that */
    curl_easy_setopt(curl, CURLOPT_URL, url);
  }
  if(redir_count == MAXREDIRS) {
    fprintf(stderr, "\nError: Maximum (%d) redirects followed\n", MAXREDIRS);
  }

@bagder
Copy link
Member

bagder commented May 8, 2016

To allow one, two or three slashes. Something like this could be applied:

diff --git a/lib/url.c b/lib/url.c
index 70ccd0f..f07dd39 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -4133,16 +4133,22 @@ static CURLcode parseurlandfillconn(struct SessionHandle *data,

     protop = "file"; /* protocol string */
   }
   else {
     /* clear path */
+    char slashbuf[4];
     path[0]=0;

-    if(2 > sscanf(data->change.url,
-                   "%15[^\n:]://%[^\n/?]%[^\n]",
-                   protobuf,
-                   conn->host.name, path)) {
+    rc = sscanf(data->change.url,
+                "%15[^\n:]:%3[/]%[^\n/?]%[^\n]",
+                protobuf, slashbuf, conn->host.name, path);
+    if(2 == rc) {
+      failf(data, "Bad URL");
+      return CURLE_URL_MALFORMAT;
+    }
+    if(3 > rc) {

       /*
        * The URL was badly formatted, let's try the browser-style _without_
        * protocol specified like 'http://'.
        */

@bagder
Copy link
Member

bagder commented May 9, 2016

Apparently browsers support any amount of slashes. They do that because their spec says so. And the spec says so because they do that.

whatwg/url#118

@bagder bagder changed the title curl handling of HTTP 301 redirection fails when response location header starts with http:///<domain> (3 slashes between protocol and domain)) handling URL with http:/// (3 slashes between protocol and domain) May 16, 2016
@bagder bagder self-assigned this May 17, 2016
@bagder
Copy link
Member

bagder commented May 17, 2016

The rant

curl has actually never been very strict or particular with its URL parsing. I mean, it even accepts URLs on the command line with the "scheme://" part completely left out, which never has been considered a URL by anyone. It also only parses the "bare minimum" for it to be able to do what it needs to, which means that it accepts other sorts of illegal URLs as well if you just want to.

I've given this a lot of thoughts and I've discussed the WHATWG-URL "standard" widely and intensely the last couple of days.

I think it would be a tactical mistake to give up completely and say we accept the WHATWG-URL as a standard. They run and write their "standard" as they see fit only for browsers without proper concern for the entire web and ecosystem. That said, hopefully they will come around at some point and we can work on converting their doc into a "real" standard. That would be of a huge benefit for the web.

This said, I think we need to be realists and adapt to the world around us and when the WHATWG clearly says these URLs are fine and a huge portion of browsers accept them, it forces us to act. Sure we can say they're not RFC3986 compliant and refuse to work with them. But who'd be happy with that in the long run? I don't think we in the curl project have enough power to make such a stance have any effect on the servers and providers that send back broken URLs in headers. They will just curse curl and continue to successfully use browsers against said servers.

The intent

I intend to merge a patch similar to what I described above, after the curl 7.49.0 release to give us time to test it out and feel it. It will accept one, two or three slashes only and it will complain in the verbose output for anything that isn't two slashes and it will rewrite the URL internally to the correct look so that extracting the URL or passing it onward to proxies etc will still use the correct format.

@jay
Copy link
Member

jay commented May 17, 2016

How is /// any more likely than //// ? They both seem really really unlikely. Is there some common server configuration error which causes the former?

@bagder
Copy link
Member

bagder commented May 17, 2016

Very anecdotal "evidence" only so more of a hunch or a guess.

We've seen the former (in this bug report) and not the latter. When I've complained to whatwg people some of them have hinted that URLs "like this" (unclear how many slashes that imply) are being found to at least a measurable extent and finally I'm just guessing that fewer slashes are more likely than more. Like /// as a typo is more common than //// just because the first is a single letter mistake and the second means twice as many mistakes.

If we at a later point reconsider and have a reason to start accepting more slashes, then there's nothing preventing us from revisiting this topic.

@bch
Copy link
Contributor

bch commented May 17, 2016

On 5/17/16, Daniel Stenberg notifications@github.com wrote:

Very anecdotal "evidence" only so more of a hunch or a guess.

We've seen the former (in this bug report) and not the latter. When I've
complained to whatwg people some of them have hinted that URLs "like this"
(unclear how many slashes that imply) are being found to at least a
measurable extent and finally I'm just guessing that fewer slashes are
more likely than more. Like /// as a typo is more common than //// just
because the first is a single letter mistake and the second means twice as
many mistakes.

Could see "///*" for misconfigured CMSs or otherwise auto-generated
code. I think somebody earlier wondered if Mozilla had telemetry on
this. Is there data, or just educated guesses and anecdotes?

If we at a later point reconsider and have a reason to start accepting more
slashes, then there's nothing preventing us from revisiting this topic.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#791 (comment)

@bagder
Copy link
Member

bagder commented May 19, 2016

Is there data, or just educated guesses and anecdotes?

There has been no data provided in this discussion, just random people making up random statements. Me included. I've mentioned that it would be possible to add a counter in Firefox or similar, but (A) I'm not sure it would be accepted by the maintainers of said code, (B) I don't feel like writing that code and (C) I fear whatever number would get out of that won't make a difference in the end.

@bagder bagder closed this as completed in 5409e1d May 30, 2016
@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

5 participants