curl-and-python

RE: Pycurl and the curl_multi_socket_action API

From: Utsav Sabharwal <tashywashysachy_at_live.in>
Date: Fri, 6 Apr 2012 17:59:05 +0530

> On Fri, 6 Apr 2012, Utsav Sabharwal wrote:>  >> But I am not sure if pycurl multi gives similar non blocking feature as in >> libcurl. I prefer using  curl easy in threaded mode. The following example >> creates 100 parent threads and I was easily able to crawl around 5M urls>> per >> day on Amazon EC2>  > 100 threads is probably fine. I would say the problem becomes slightly > different when you go up to 1000 or 10000 connections...>  > -- >  >   / daniel.haxx.se>>> _______________________________________________> http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-pythonIn the given implementation, I actually created 100 cyclic infinite threads. So it was like each thread use to pop from a queue, crawl it use the easy way and pop again. By this I was able to overcome the problem of waiting for http response and keep my network almost fully consumed. In general multi curl is non blocking so it could have provided us same effects in a single thread if we keep adding urls even during the multi curl run bu
t then in pycurl trying to add while multi curl is performing is not possible. That means until first 10,000 dont get crawled/timed out no place for new ones. Thus, implementing pycurl multi curl I was just able to crawl 100 000 urls / day while with former implementation 10 M urls / day on same machine. Though, I believe the same is not the case in C. Due to non blocking lower level implementation libcurl performs better even than the former implementation. Moreover, too many threads in python causes threads to block each other due to GIL and max possible threads = total virtual memory / (stack size * 1024*1024) which comes to be 324 on Amazon EC2 small. Thus, 100 parent having 100 children totaling to 200  sounded best approach on my environment. 

d_r_a_G_o_s 
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2012-04-06