Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hsts: add support for Strict-Transport-Security #5896

Closed
wants to merge 4 commits into from
Closed

Conversation

bagder
Copy link
Member

@bagder bagder commented Aug 31, 2020

  • enable in the build (configure)
  • header parsing
  • host name lookup
  • unit tests for the above
  • CURL_VERSION_HSTS
  • CI build
  • curl_version_info support
  • curl -V output
  • curl-config --features
  • libcurl API to enable HSTS (CURLOPT_HSTS_CTRL)
  • command line support for --hsts
  • --hsts docs
  • CURLOPT_HSTS_CTRL docs
  • save cache to disk (and tested in unit test)
  • load cache from disk (tested in unit test)
  • API to set cache file name (CURLOPT_HSTS)
  • CURLOPT_HSTS docs
  • docs/HSTS.md
  • --version docs updated
  • adjusted curl_easy_duphandle accordingly
  • HSTS read/write callbacks
  • man pages for the 4 new setopts for callbacks
  • libtest to verify reading/writing HSTS cache with callbacks (1915)

Initial thoughts on API and usage: https://github.com/curl/curl/wiki/HSTS

Targeted to get merged for 7.74.0

@bagder bagder added HTTP feature-window A merge of this requires an open feature window labels Aug 31, 2020
@bagder bagder force-pushed the bagder/hsts branch 9 times, most recently from ecd7b61 to b4fe60e Compare September 7, 2020 06:14
@JCMais
Copy link
Contributor

JCMais commented Sep 11, 2020

I just found the email talking about this feature on the mailing list and also had a quick look into the wiki.

About this:

ability to save to something else than a file

Would be incredible if supported from the get-go, we probably could have function just like READFUNCTION/WRITEFUNCTION but responsible for loading / saving the HSTS cache. These would default to loading/writing the file provided with CURLOPT_HSTS.

@bagder
Copy link
Member Author

bagder commented Sep 12, 2020

ok, let me write up a proposal for how it could work and run it by you before I start any code on that...

@bagder
Copy link
Member Author

bagder commented Sep 12, 2020

@JCMais
Copy link
Contributor

JCMais commented Sep 12, 2020

This is great @bagder, thanks for taking the time to look into it.

What would be a scenario that CURLOPT_HSTSWRITEFUNCTION would return CURLSTS_AGAIN/CURLSTS_DONE? As it's libcurl that is passing the data that should be copied, the callback itself has no way to know if it's the last item or not.

@bagder
Copy link
Member Author

bagder commented Sep 13, 2020

CURLSTS_DONE would be a way for the callback to say "stop don't give me any more names", before all names have been passed to it. Basically stopping the write without it being a major error. I'm not sure the distinction is actually important to have.... So it would have to return CURLSTS_AGAIN to get called with the next name.

Should the write callback get signaled somehow when the last entry is passed to it?

@JCMais
Copy link
Contributor

JCMais commented Sep 13, 2020

I see, makes sense.

Should the write callback get signaled somehow when the last entry is passed to it?

Yes, I think this would be great. Example: to close the stream that is being used to write the data to.

If possible, the callback could receive the index of the item currently being processed, and the total amount of items, or any other data structure that would convey the same information.

@bagder
Copy link
Member Author

bagder commented Sep 13, 2020

I added an index struct to the write callback, and I noticed I had forgot to include the expiry date in the hstsentry struct.

@bagder
Copy link
Member Author

bagder commented Sep 14, 2020

A first shot on read/write callbacks is now provided in this PR as well. The wiki description of the funtions is updated to match the code (and there are initial manpages in the branch). This still lacks test cases for the callbacks. I'll appreciate all and any feedback I can get on this.

Copy link
Contributor

@JCMais JCMais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor comments

docs/HSTS.md Show resolved Hide resolved
docs/libcurl/opts/CURLOPT_HSTSREADFUNCTION.3 Outdated Show resolved Hide resolved
docs/libcurl/opts/CURLOPT_HSTSWRITEFUNCTION.3 Outdated Show resolved Hide resolved
@ghost
Copy link

ghost commented Oct 1, 2020

DeepCode's analysis on #1c65de found:

  • ℹ️ 1 minor issue. 👇

Top issues

Description Example fixes
Using strcpy can lead to severe buffer overflow vulnerabilities. Use the safe alternative strncpy instead. Occurrences: 🔧 Example fixes

👉 View analysis in DeepCode’s Dashboard | Configure the bot

vszakats added a commit to curl/curl-for-win that referenced this pull request Oct 15, 2020
According to the draft implementation in curl, an external
library will not be needed.

Ref: curl/curl#5896
@bagder bagder marked this pull request as ready for review October 20, 2020 21:30
@bagder bagder removed the feature-window A merge of this requires an open feature window label Oct 21, 2020
@tlhackque
Copy link

I took a quick look at this proposal. It appears that it is based on the idea that the HSTS cache is loaded and saved sequentially from a flat file at library initiation and teardown. I'm not sure this scales - especially if HSTS finally gains widespread traction.

Questions:

  • What sets the maximum size of the HSTS cache? Should that be an option? What would happen if there were 100M hsts-enabled hosts?[1] Does the cache lookup work (enough hash buckets? linear search)?
  • What if the backing store was a database? Shouldn't the read callback include the host being accessed? (Assuming a lazy init.)
    This would enable a simple SELECT from a database, or provide a hint to a flat file about what to return. How would a store operation write back only the dirty (modified) records? (It's likely that there aren't any...or at least many. HSTS is long-lived.)
  • For many applications (e.g. the curl command line), it seems likely that you only need the records for one or few hosts? Does it make sense to load the entire cache for this? OTOH, a persistent library user (e.g. a server) may benefit from a large preload.
  • What influences the caching algorithm? LRU makes sense for random user interactions. But if you have an application that scans a large number of sites in the same order every time, a preload algorithm would work (and LRU could be counter-productive).

I'm not opposed to a simple flatfile default, but you should consider the alternatives. (A database doesn't have to be massive - SQLite, BerkeleyDB, or even a purpose-built backend might be used.)

In any case, I suggest thinking about scalablitly and the (implict) use models. I think that at a minimum, the API should support a database back end, and that the cache load should happen when a cache miss happens, not in library init. This also includes providing a hostname in an eviction callback, not just "write whatever is left in the cache" at library teardown.

Sorry if I'm missing something - I saw this pop up, but am too time-limited to dive deeply into the code.

Nonetheless, I hope these comments are useful.

[1] Yes, that's a very large number to make the point. Consider a library user that's a search engine. Or even 10M or 1M.

@bagder
Copy link
Member Author

bagder commented Oct 21, 2020

Thanks for good feedback and excellent questions. I think this also shows that it is a good idea to ship this feature marked as experimental to start with, to let users try things out as they are now and we can polish and improve things according to what users want and experience.

the HSTS cache is loaded and saved sequentially from a flat file at library initiation and teardown

If you use that feature, yes. A library user can deal with it however it likes.

I'm positive most applications using libcurl will not populate the cache with millions of entries even if/when HSTS is used very widely, since very few applications work with that amount of random sites in a browser-like fashion.

If we want to build the system to actually have a huge hsts database, then yes we'd need something different. The current design is made with a small dataset in mind.

What sets the maximum size of the HSTS cache?

Nothing.

Should that be an option?

Perhaps. But I rather see an actual use for this option before we go ahead and add more bells and whistles.

Does the cache lookup work (enough hash buckets? linear search)?

There's nothing in neither the file format nor in the API for the callbacks that says anything about how the search is to be done. If someone experiences slow HSTS handling I don't think we will object to performance enhancements. I don't think we need to optimize for such a future hypothetical case now but take it as it comes.

What if the backing store was a database?

Then you're probably asking for an API that lets the application maintain the database and does the entire lookup (because I can't see a future where libcurl bundles some sort of database). That's not provided here but I wouldn't be against it if someone actually provides a viable use case where this is necessary/beneficial.

We could of course perhaps consider such an "asking API" instead of the current load/save API to better allow applications to scale to large amounts of hsts entries.

For many applications (e.g. the curl command line), it seems likely that you only need the records for one or few hosts? Does it make sense to load the entire cache for this?

What is "the entire cache" for you? I would imagine a typical command line use case have a very small cache of just a bunch of hosts so reading "the entire cache" is very fast and small operation. Separate caches for separate invokes. Having two different curl command line tools using the same file on disk will just make the second overwrite the first with the existing functionality.

What influences the caching algorithm?

Standard coding practice: first we deliver something basic that works. If there's a need, we improve it over time. If not, basic is fine.

the cache load should happen when a cache miss happens, not in library init

That's a fair idea. I don't think the current API dictates when the cache gets read so we can work on tweaking that.

@bagder
Copy link
Member Author

bagder commented Nov 2, 2020

I'll merge this shortly, Buckle up!

- enable in the build (configure)
- header parsing
- host name lookup
- unit tests for the above
- CI build
- CURL_VERSION_HSTS bit
- curl_version_info support
- curl -V output
- curl-config --features
- CURLOPT_HSTS_CTRL
- man page for CURLOPT_HSTS_CTRL
- curl --hsts (sets CURLOPT_HSTS_CTRL and works with --libcurl)
- man page for --hsts
- save cache to disk
- load cache from disk
- CURLOPT_HSTS
- man page for CURLOPT_HSTS
- added docs/HSTS.md
- fixed --version docs
- adjusted curl_easy_duphandle
- read/write callback options
- man pages for the 4 new setopts
- test 1915 verifies the callbacks
@bagder bagder closed this in 7385610 Nov 3, 2020
bagder added a commit that referenced this pull request Nov 3, 2020
- read/write callback options
- man pages for the 4 new setopts
- test 1915 verifies the callbacks

Closes #5896
@bagder bagder deleted the bagder/hsts branch November 3, 2020 15:09
@curl curl locked and limited conversation to collaborators Feb 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

Successfully merging this pull request may close these issues.

None yet

3 participants