cURL / Mailing Lists / curl-and-php / Single Mail

curl-and-php

FOLLOWLOCATION as true doesn't seem to work

From: Earl Brown <eabrown_at_csumb.edu>
Date: Sun, 6 May 2012 23:24:18 -0700

PHP curl users,

I can't figure out why I can't get the html source code of:

http://corpusdelespanol.org/x2.asp

even though I can get the source code of:

http://corpusdelespanol.org/x.asp
http://corpusdelespanol.org/x1.asp
http://corpusdelespanol.org/x3.asp
http://corpusdelespanol.org/x4.asp

My PHP script starts at x.asp and gets a cookie and then passes
sequentially through the webpages. For some reason I can't get the
source code of x2.asp. Instead, I get the source of blank.asp. When I
use a browser (whether, Firefox, Chrome, Opera) I can see the source
code of x2.asp accurately, but curl doesn't give it to me. Instead it
gives me the source code of blank.asp, even though I have
FOLLOWLOCATION as true. I'm at a loss. One idea: x2.asp as some style
tag definitions above the opening <html> tag while the other pages
don't; the others start with the opening <html> tag and then define
styles. Could that throw off curl? Are browsers able to correct this
while curl is not? Here's my PHP script:

start of script
_____________

<?php

if (empty($_GET['word'])) {
die("Need to specify the word as a GET parameter, i.e. ?word=&lt;word&gt;");
}

$ch = curl_init();

curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch,CURLOPT_HEADER,true);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_COOKIEJAR,'cookie.txt');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Macintosh; Intel Mac
OS X 10_7_2) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202
Safari/535.12011-10-16 20:21:13');

curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x.asp');

curl_exec($ch);

curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x1.asp');

curl_exec($ch);

curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x2.asp?chooser=seq&p='
. urlencode($_GET['word']) .
'&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&sec1=19&sec1=18&sec2=0&sortBy=freq&sortByDo2=alph&minfreq1=freq&freq1=4&freq2=4&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&corpus=cde&ownsearch=y&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=1&s2=2&s3=3&perc=mi');

$second_page = curl_exec($ch);

echo $second_page . "<br /><br /><br />";

curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x3.asp?r=23&w11='
. urlencode($_GET['word']));

$inpage = curl_exec($ch);

preg_match('|<span ID="w_section">SECTION</span>:
ORAL&nbsp<b>\(([0-9,]*)\)</b></td>|Ui', $inpage, $matches);

echo str_replace(',', '', $matches[1]);

__________
end of script

Instead of giving me what I see as the source of x2.asp in a browser,
I get this header and then the source of blank.asp:

HTTP/1.1 302 Object moved
Date: Mon, 07 May 2012 05:49:52 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Location: blank.asp
Content-Length: 130
Content-Type: text/html
Cache-control: private

HTTP/1.1 200 OK
Date: Mon, 07 May 2012 05:49:52 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 755
Content-Type: text/html
Cache-control: private

I have assumed FOLLOWLOCATION as true would fix this problem, but it
doesn't. Any ideas?

Thank you for your help. Best, Earl Brown

--
Earl K. Brown, PhD
Chair, Language ULR Committee
Assistant Professor of Spanish Linguistics
School of World Languages and Cultures
California State University, Monterey Bay
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-php
Received on 2012-05-07