cURL / Mailing Lists / curl-and-php / Single Mail

curl-and-php

Partial html output problem

From: James Hood <ebenblues_at_yahoo.com>
Date: Fri, 19 Oct 2007 00:45:59 -0700 (PDT)

Hi,

I'm having a strange issue using curl with php and hope someone on this list can help or point me in the right direction. Basically, I'm writing a
 php script that will post to a web form, capture the resulting html and use DOM/simpleXML functions to parse it. The returned html page is very long and I am running into a strange issue where if I try to manipulate the html output directly in the script, only the first few hundred bytes of the html output is captured in the parsed data. But the odd thing is, if I write the html out to a file, then in a separate script read in the html from the file and then parse it using the same method, I get the entire html document in the parsed data. Because of this test, I believe it's a curl issue. Here's the basic script:

$c = curl_init($site->post_url);

curl_setopt($c, CURLOPT_POST, true);
curl_setopt($c, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);

$html = curl_exec($c);
curl_close($c);

// Option 1: save output html to a file
// $handle = fopen("out.html", "w");
// fwrite($handle, $html);
// fclose($handle);
// die();

// Option 2: parse output html directly in the script
error_reporting(E_ERROR); // Turn off warnings before DOM parsing
$dom = new DOMDocument();
$dom->loadHTML($html);
$items = $dom->getElementsByTagName("html");
$xml = simplexml_import_dom($items->item(0));
echo "<pre>\n";
print_r($xml);
echo "</pre>\n";

When I do option 1, the output file written contains the entire html output as I would expect. But when I do option 2, the simpleXML object only contains the first couple hundred bytes of the html output. I suspected a problem with the DOM parsing/simple XML calls, so I made a separate script that would read in the html that was written out to disk by this script and then do the exact same method of parsing and using that script, the simpleXML object contained the complete parsed html document.

I've tried turning return transfer off and using php output buffering (ob_start and ob_get_contents), but I get the same result. Is this a curl bug or am I doing something wrong here? Any help would be greatly appreciated.

I'm running php version 5.2.3. Here's my libcurl info from phpinfo():

 libcurl/7.13.2 OpenSSL/0.9.7e zlib/1.2.2 libidn/0.5.13

Also, here's the output if I set curl to verbose mode:

* About to connect() to tarmls.rapmls.com port 80
* Trying 207.138.156.54... * connected
* Connected to tarmls.rapmls.com (207.138.156.54) port 80
> POST /scripts/mgrqispi.dll HTTP/1.1^M
Host: tarmls.rapmls.com^M
Pragma: no-cache^M
Accept: */*^M
Content-Length: 4604^M
Content-Type: application/x-www-form-urlencoded^M
Expect: 100-continue^M
^M
< HTTP/1.1 100 Continue^M
< HTTP/1.1 200 OK^M
< Date: Fri, 19 Oct 2007 07:44:14 GMT^M
< Server: Microsoft-IIS/6.0^M
< X-Powered-By: ASP.NET^M
< Content-Type:text/html^M
< Content-Length: 1958671^M
* Connection #0 to host tarmls.rapmls.com left intact
* Closing connection #0

Thanks,
James
  
------------------------------
"The humble learn the fastest because they don't waste time on defending a false image."

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-php
Received on 2007-10-19