cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: Project hiper - High Performance libcurl

From: Jamie Lokier <jamie_at_shareable.org>
Date: Wed, 9 Nov 2005 20:07:48 +0000

Daniel Stenberg wrote:
> >The current multi interface of having the user wait for events is not an
> >efficient design when it comes to IOCP.
> >
> >IOCP uses threading to get full scalabilty on SMP systems.
>
> But does it really scale that well on the ordinary plain non-SMP system? Is
> your normal Windows box up to serving 10,000 threads? (if you have the RAM
> for it). And I would expect that using a thread for each connection will be
> more memory consuming that using a single thread for all transfers (due to
> stack and thread context overhead).

As I understand it, the most efficient models are based on events,
like your design, but with a bounded pool of threads to provide a
certain amount of concurrency. This is efficient even on a single
CPU, because it provides concurrency when there is blocking I/O,
e.g. due to swapping or disk I/O. It also has potential advantages in
managing average latency and fairness, when some event operations take
unusually long to be processed.

The idea is that you have the usual event callbacks and state
machines, but the callbacks for different events can run concurrently.

It's up to the thread/event scheduler to optimise the concurrency,
priorities etc. to strike the best balance. This means that on a
single CPU system, it would use only one or a few threads. On a
larger system, it would use more threads. And, if some threads are
tending to block on I/O or thread-specific swapping (e.g. looking
things up in an in-memory hash table, which is paged out), the number
of threads increases to provide greater throughput. Multiple CPUs,
blocking, and slow event handlers are the reason for having concurrency.

The event driven state machines are there to remove the overhead that
comes with large numbers of threads. Particularly, thread stack
contexts take a lot of memory, but there are many other overheads too.

Another significant advantage of single-threaded, event state machines
is that no data locking is needed, due to lack of concurrency.
Locking is not cheap.

To retain that advantage, it is best to encode rules which ensure some
groups of event handlers will not be run concurrently. In effect,
it's like saying that every event handler is associated with a lock,
and all the handlers in a group share the same lock - but with the
scheduler optimising away the actual locking in that case. This means
the event handlers in a group can access the same data, without
needing finer-grained locks.

For a library like Curl, a likely rule would be to put all the
handlers for a specific file descriptor, or for a specific request, in
a group. _Some_ locking is then needed because file descriptoers
service multiple requests, and requests may access multiple file
descriptors. But the locking can be quite coarse-grained.

The above sounds quite complicated and rarely done, and it is.

A good approximation is done, very simply, by starting a small pool of
threads, and each thread _indepdently_ handling it's own large set of
requests etc. You can do this with Curl today, and with Curl's
enhancements to use large numbers of file descriptors that we've
already talked about.

> >It could certainly be done from one thread, but then you would be losing
> >many of the benefits. This is aimed at being high performance so libcurl
> >(imho) should definately add the minimal threading awareness for IOCP, if
> >only for this hiper API.
> >
> >The API I gave was complete. You open a hiper handle and push your easy
> >handles into it. The hiper handle would open one or more threads in the
> >background and callback when something significant happens to an easy
> >handle. Simple as that. If you wanted to wait until all the easy handles
> >are finished, you can call the wait function.
>
> (I see a problem to merge a Windows-optimzed concept with the currently
> planned event-optimized concept...)

I don't think it's a Windows-optimised concept, particularly. Maybe
Windows provides an implementation? The same concept is applicable to
any modern OS which has threads.

> If that is what you want (using a thread for each transfer), won't it
> suffice to simply start a new thread and fire off a separate
> curl_easy_perform() in there? In what way would this suggested interface be
> an improvement to that?

I'd be very surprised if a thread per request was intended. That
would indeed be very resource-intensive.

> HTTP pipelining might be hard to add nicely for such a use case.
>
> >If hiper is meant to abstract away all the stuff needed for high
> >performance http, it should also be in charge of threading efficiently as
> >needed.
>
> I want libcurl to abstract away all protocol and transfer related matters.
> I want it to know as little as possible about event systems and threading
> models.

I agree. That's a good plan.

I might have misunderstood the grandparent-post's proposals, adding
instead my own flavour. :-) (I am writing a library of the type
described above - I'm allowed to be biased!).

We've discussed before what kind of API is needed to wait on large
numbers of file descriptors - a "scalable" method. We did only look
at the case of a single thread.

Such applications can use multiple threads, simply by submitting
requests in different threads, and each thread runs independently.

That would work fine, even on Windows.

But I think, off the top of my head, that the essential API feature
we're now talking about is OS-supported methods of _automatically_
distributing work across threads - instead of requiring the Curl-using
application to distribute requests itself.

That's really an advanced feature that would rarely be used, I think.
However, for the biggest data-moving applications, I suspect it would
be a performance enhancement. It's very hard to know, without trying
it. The theoretical corner-case gains might be swamped, in all
practical scenarios, by the overheads of additional locking that come
with it.

I think such fancy event/thread scheduler should not be part of Curl;
it should be a project in its own right. (One I've heard of for unix
is called libasync-smp. I'm slowly working on one myself, too).

But to support it, the question is whether Curl's API would be better
adapted so that it _could_ work with something like that - something
where events that are handled in one thread may migrate to other
threads.

I think that comes down to thinking about what locking and shared data
structures are used in Curl, and then specifying rules that say which
of the file descriptor event handlers can be called in different
threads than the requests which originated them, and what the
non-concurrency groups are.

libasync-smp's approach to that is for each event handler to have a
"colour", which is a single integer. Handlers with the same colour
will not be run concurrently, and all handlers are assigned a default
colour of zero - or alternatively a colour corresponding to the thread
that originated the request. The event handlers can then change their
own colours, e.g. to a colour corresponding to a file descriptor or
something, if they implement sufficient locking that running them on
non-home threads is safe.

There is also something called "gang scheduling" which I'll not
describe in detail. This is where event handlers indicate their
similarity of code/data cache usage, so they can be scheduled in a way
which groups similar uses in sequence.

This is all complex and great stuff for high performance web servers
that do lots of complicated processing.

But is it worth exploring this level of detail for Curl, or would
simply implementing scalable file descriptor events within each thread
be quite enough for realistic applications of Curl?

-- Jamie
Received on 2005-11-09