cURL / Mailing Lists / curl-library / Single Mail

curl-library

performance at scale

From: Nick Gerner <nick_at_seomoz.org>
Date: Fri, 2 Oct 2009 11:19:14 -0700

I'm curious if anyone has any tips about performance of libcurl at scale. I
have some pretty good crawling code that I'm always trying to tune. I'm
running curl_multi with poll and between 500 and 1000 curl handles.
Most recently I grabbed an oprofile snapshot when one core is pegged (the
acutual crawling code runs in a single thread) and found something
interesting:
  2528322 45.0216 libcurl.so.4.1.1 (is actually being called from two apps,
but mostly it's the crawling app below)
  1128251 20.0907 libc-2.7.so
   760063 13.5344 no-vmlinux
   635400 11.3145 MY_PARSER_APP_HERE (runs in a separate process)
   161909 2.8831 MY_CRAWLING_DRIVING_APP_HERE (runs in the same
process/thread as libcurl listed above)
   106982 1.9050 libz.so.1.2.3.3
    98056 1.7461 liblzo2.so.2.0.0 #the input for my crawl is lzo compressed
    51165 0.9111 libcares.so.2.0.0
    48928 0.8713 pdns_recursor #I'm using pdns recursor locally to do dns

And more interestingly:

977123 38.6471 url.c:0 ConnectionExists
477781 18.8972 (no location information) Curl_raw_equal
344057 13.6081 hostip.c:0 hostcache_timestamp_remove
230962 9.1350 rawstr.c:0 my_toupper
184067 7.2802 (no location information) Curl_hash_clean_with_criterium
67392 2.6655 (no location information) curl_multi_remove_handle
65826 2.6035 (no location information) Curl_hash_pick
35846 1.4178 (no location information) Curl_hash_add

That ConnectionExists call seems to take a lot of time! Looking at the
code, it looks like ConnectionExists should not get called if I set
curl_easy_setopt(curl[i]->curl, CURLOPT_FRESH_CONNECT, (long)1);

So I did that and got much better performance. But I still see basically
the same oprofile report (basically 40% of my CPU time is in libcurl and 40%
of libcurl's time is spent in ConnectionExists). So... any thoughts on:

1) why ConnectionExists takes so long? (I'm guessing it does an expensive
traversal of a really big list of maybe 4k cached connections)
2) why I'm still getting all this time spent in ConnectionExists
3) any other general perf tips (e.g. other curl_easy_setopt or
curl_multi_setopt settings, or maybe compile time options)

Some useful info:
$ curl-config --version
libcurl 7.19.3

I know, I should upgrade, but we had some stability issues with a slightly
newer version than this and rolling back fixed it.

$ curl-config --features
SSL
IPv6
libz
AsynchDNS
NTLM

Thanks a million!

--Nick

-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2009-10-02