curl-and-python

PyCurl as a high-performance client

From: <scott_py_at_itsanadventure.com>
Date: Thu, 11 Dec 2008 11:52:00 -0500 (EST)

Over the course of the last 7 years I developed a program in Perl which
retrieves web pages as fast as possible. The program generates traffic to
test devices between the "client" PC running my Perl program, and a
"server" PC running Apache. My 7 years of hacking Perl resulted in a bunch
of messy code, so I decided to learn Python be re-writing this program in
Python.

I've been going through some testing phases, writing some of the important
pieces in Python - things like reading a configuration file, collecting
statistics and generating reports, and several other tasks. It's been a
good way to learn Python.

Right now I'm testing one of the most important pieces - retrieving files
from the HTTP server. After a bit of digging, it seems that PyCurl is the
right tool for this, so I've been messing with some of the example PyCurl
scripts, like retriever-multi.py.

retriever-multi.py has what seems to be a great feature, the ability to
handle concurrent connections. In my Perl script, I achieve this by
fork'ing a bunch of processes, and each process goes off and starts
retrieving files from the server. I had originally planned on using
threading with Python 2.6 to do this, but PyCurl's ability to do
concurrent sessions may eliminate the need to consider threading.

Now the problem...

As an example: Using my Perl script, I have 20 processes each retrieving a
100KB file over and over again until I tell it to stop. This will generate
1 gigabit per second of traffic (the NIC's limit) and have CPU left over.

With a slightly-modified retriever-multi.py set to 20 concurrent
connections retrieving the same 100KB file, it generates slightly over 5
MEGABITS per second, and the CPU is maxed out. This is 1/200th of what my
old Perl scripts can do - probably even less. (The modifications made to
retriever-multi.py were to remove any writing to disk or screen. The
retrieved file is simply stored in a memory buffer which is flushed before
every connection.)

The questions...

Is PyCurl the right tool for this job?

Is PyCurl's concurrent sessions the right way to do this, or might it work
better with multiple threads each having a single session?

If you want more info, let me know, but the main idea of the program is
too retrieve MANY files as fast as possible, over and over again.
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2008-12-11