cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: [OT] Simple HTML parser needed

From: Jamie Lokier <jamie_at_shareable.org>
Date: Mon, 31 Jan 2005 14:12:58 +0000

Daniel Haude wrote:
> I know this is off-topic in this list, but since I believe that many
> people use libcurl to retrieve HTML data from which they seek to extract
> information I suspect that there will be some knowledge here on how to
> parse HTML.
>
> I'm not looking for a full-fledged DOM parser, just something that
> produces a "flat" stream of tags with attributes and normal text.
>
> I can roll my own but I'd like to know if there's some "industry
> standard" thing for this. I know plenty of XML parsers, but none of them
> seems to like digesting the typical broken HTML found on many web pages.

libxml2 has an adequate HTML parser. It gives you a stream of tags or
a DOM tree, whichever you prefer, just the same as if it were parsing
XML.

Like all HTML parsers, it parses slightly differently from all the
others, because there is no industry standard grammar for real world
HTML. But it is quite good enough for most inputs.

-- Jamie
Received on 2005-01-31