DIHtmlParser recognizes 10 pieces of HTML plus 4 pieces of Non-HTML.
The HTML pieces are:
<![CDATA[
and ends with ]]>
.<!–
and ends with –>
.<!DOCTYPE
and ends with >
.<?
and end with >
.<TagName Attribute=“Value” />
.<SCRIPT>
and </SCRIPT>
tags as simple text. The surrounding HTML tags are reported separately.<STYLE>
and </STYLE>
tags as simple text. The surrounding HTML tags are reported separately.NormalizeWhiteSpace
option is enabled, DIHtmlParser reduces multiple white space to a single character. Preformatted text wrapped by <PRE>
and </PRE>
is never normalized.<TITLE>
and </TITLE>
tags as simple text. Titles are not normal text because they are parsed differently.<?XML
and end with ?>
.The Non-HTML pieces are:
<%
and runs up to %>
.#
like in <#Name Attribute=“Value” />
.<?PHP
and ends with ?>
.<!–#
and continues up to –>
. It allows to insert include files and other data into HTML documents on the fly.DIHtmlParser is extremely fast, especially when parsing huge files. Thanks to the internal buffer mechanism, it does not need to load the entire file into memory at once but can read one small chunk after the other at a single time only. DIHtmlParser parses up to 50 000 tags per second even with an outdated 166 MHz processor. On modern machines the score goes up to more than 15 MB of HTML data per second.
DIHtmlParser only parses what it needs to parse. Thanks to its filtering mechanism, the parser can skip all pieces of HTML which the application did not request. Even though the parser must eventually touch each single character of a HTML document, it might only need to store a fraction of that data for further processing. We call this “Smart Parsing”, as not storing unnecessary data is one of the greatest time savers.
Another trick of “Smart Parsing” is to convert relevant tag and attribute strings into ordinal number IDs. As a result, the parser never needs to compare lengthy strings consisting of many characters but can easily get away with one simple number comparison instead. This improves performance and reduces processor load. Your own coding benefits from this technique, too, as tag and attribute IDs are part of the DIHtmlParser interface.
Tag filtering forwards the general filtering to individual tags. It enables the programmer to instruct the parser to hold back all tags which are not relevant to the application. Why bother with <TABLE>
tags if you are only interested in the images of a HTML document? Instead of having the application check each tag for an <IMG>
tag, simply instruct the parser only to report <IMG>
tags in the first place. This allows DIHtmlParser to optimize its parsing and your application no longer has to worry about unwanted tags.
DIHtmlParser Plugins are the next step to customized HTML parsing. A single instance ot TDIHtmlParser can run any number of parsing processes in parallel to the its main parsing process. Each plugin features its own flexible filtering mechanism just as the main parser. The plugin architecture keeps overhead to a minimum, as each of them informs the parser about its requirements ahead of the parsing. So even with many plugins in effect, DIHtmlParser will never parse more than what your application actually asks for.
More information on DIHtmlParser Plugins is available on this page.