curl-and-python

RE: Pycurl and the curl_multi_socket_action API

From: Utsav Sabharwal <tashywashysachy_at_live.in>
Date: Sat, 7 Apr 2012 17:49:29 +0530

> Thank you Utsav,> Are you sure that the pycurl multi does not accept additional easy handles while performing ?> > Actually, I managed to get my virtual users authenticate and start a web browsing session with random urls that are added to the multi in the main loop. It does work perfectly. I just have a problem with the m.select() that does not block enough leading to CPU over-consumption. I didn't try to diagnose it since I'm trying to avoid the select().> The main loop is very similar to the one given in the first example that you gave.> > Threading is not a good alternative for me because it not only about crawling performance : The timing is also important to have a realistic simulation. The virtual users simulate real web traffic:  They select a random page, download it, parse it to download all the related media at once (images, css, js, etc.,) and finally, they "sleep" for a defined duration before requesting another page.> > A thread pool with a task queue would delay requests. A thread (or more)
per user would lead to too much concurrency and would give poor results !> > I really wanted to stick with python because it makes it easier to parse the html and because classes would make the code much more readable than linear C (object inheritance is perfect to implement different user behaviors). Maybe I'll use pycurl to download the html pages and send the related media (images, css, js, etc.,) to hiperfifo.c to take advantage of the strengths of both languages.> > Ideas are welcome :)> > Cheers,>> >> >> On Fri, Apr 6, 2012 at 2:36 PM, Daniel Stenberg <daniel_at_haxx.se>> wrote:>> On Fri, 6 Apr 2012, Utsav Sabharwal wrote:>> >> In general multi curl is non blocking so it could have provided us same effects in a single thread if we keep adding urls even during the multi curl run but then in pycurl trying to add while multi curl is performing is not possible.>> >> That seems like a really stupid restriction you should work on fixing...>> >> >> -- >> >>  / daniel.haxx.se>> ___________________________________
____________>> http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python>> >> >> _______________________________________________ http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python

Are you sure that the pycurl multi does not accept additional easy handles while performing ?

Did you try adding a url while m.perform in performing? By performing I mean the time when m.perform was executed and the time ret and num_handles are returned. Definitely, you can add easily after ret and num_handles are returned -:)

while 1:        ret, num_handles = m.perform() #you can definitely add here if ret != pycurl.E_CALL_MULTI_PERFORM:            break

> Threading is not a good alternative for me because it not only about crawling performance : The timing is also important to have a realistic simulation. The virtual users simulate real web traffic:  They select a random page, download it, parse it to download all the related media at once (images, css, js, etc.,) and finally, they "sleep" for a defined duration before requesting another page.> > A thread pool with a task queue would delay requests. A thread (or more) per user would lead to too much concurrency and would give poor results !> 
You cannot achieve a speed more than your max bandwidth at a given time and you cannot reduce the time a server takes in handshaking etc (in the complete url transfer). I guess, you can only make other request while a socket is waiting for response. In other words, any architecture that only enforce per user delay time to be equal to real time for socket operation when network is available is the best one for you. 
That being said, if wait time for each request is too much there is always a limit of number of sockets opened in either approach (async IO / thread). So, if your users are in big numbers I am not sure you can provide them completely parallel performance in worst case scenario where too many are waiting for http response.
Designing a performance crawler in your case requires many factors to be studied. Like is your system defined to set of domains? Can you reuse a handle? Every new handle has its own cost? What operations you want to perform? Do you have multiple Systems? How about distributive architecture? How much is processing cost? How does you plan to do so in parallel without effecting crawler performance etc. And so on.. .
With little tweaking I was able to use even the threaded approach to crawl 45M - 50M urls per day. And at times I am not able to get 1M urls per day even using async I/O approaches. At lower level, libcurl provides you ultimate performance, that is what I can assure you :)
If you can share you crawler code, we might be able to explain you the most optimal solution for a given situation -:)

d_r_a_G_o_s
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2012-04-07