cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: what is the best way to extract urls from web page with?

From: Yaroslav Samchuk <yarcat_at_ank-sia.com>
Date: Tue, 6 Nov 2007 13:03:51 +0200

On 06.11.2007, at 12:27, <hallouina-ml_at_yahoo.fr> wrote:

> Hello everybody,
>
> Please, I want to know how to extracts urls from a web page with C+
> +? I
> first think to used regex, but with this way I can only extract the
> first url maybe? Or I think after to handle my webpage like an html
> tree, like treebuilder in perl. A friend say that this is slow to do
> like this. Otherwise I don't know the kind of tools to extract url
> from
> my page like a tree.
>
> What'is the best way? If this is to handle the page like a tree, what
> kind of simple library could I used please?

I think with any regular expression engine provides searching
functionality. You can use boost::regex_search, for example. Here is
an example
http://www.boost.org/libs/regex/example/snippets/
regex_search_example.cpp
or check this one (it contains a reference to the examples page)
http://www.boost.org/libs/regex/doc/index.html

I think there's no need to build a tree unless you really need it.

Oh! If you're using C, then try PCRE and check for
pcre_get_substring_list function.

--
Regards,
Yaroslav Samchuk
Received on 2007-11-06