cURL / Mailing Lists / curl-and-php / Single Mail

curl-and-php

Re: Curl instance against web browser

From: Daniel Stenberg <daniel_at_haxx.se>
Date: Mon, 2 Mar 2009 14:50:04 +0100 (CET)

On Mon, 2 Mar 2009, jom dalina wrote:

> However it happens that I am scraping a site that requires login, my problem
> is that in web browser, the site does not accept reload of browser, direct
> access of page thru url, using "BACK" button from browser and opening a page
> in new tab also get me an error session.
>
> Now here is the scenario, I can login using curl and accessed the page after
> i successfully logged in, but i can't navigate the succeeding page. it get
> me an error that either my session(website), or i accessed the page
> directly.

That sounds just like "normal" quirks and details you need to overcome when
you're into treading the minefield of HTTP scripting. Or perhaps more
specificly trying to mimic "a user in front of a browser".

> Since the site only works for the browser where you open it, and using a new
> browser window or either a tab will result in an error, I just thought that
> every curl activity is independent instance to each other.

If the server can detect you just open a new tab, it is probably trying hard
to detect this, and you must then up your efforts a notch and try hard to look
exactly like the session you want to be.

This has very little to do with libcurl's ability to "do sessions" or similar,
but is "only" a matter of your scripts needing to behave exactly as the
browser behaves when it follows the links on the site. LiveHTTPHeaders is your
friend and ally in this combat.

This can be trivial down to downright hard. I've tried to sum up some things
to consider in this document:

         http://curl.haxx.se/docs/httpscripting.html

Good luck!

-- 
  / daniel.haxx.se
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-php
Received on 2009-03-02