Table of Contents
DIHtmlParser: Plugins
Overview
DIHtmlParser plugins link themselves into the main parsing process of TDIHtmlParser. As the parsing proceeds, they receive the same information about the parsing process as the main parser. The plugins can then react on that information and perform specific tasks which are not built into TDIHtmlParser by default.
Plugins are a flexible approach to extend the functionality of TDIHtmlParser. They can act on HTML data in completely new ways unknown to TDIHtmlParser at the time of its writing. This allows the parser to stay small and concentrate on what it does best: Fast and reliable HTML parsing. Each of the plugins, on the other hand, can add its own specialized functionality to the core HTML parser as required.
A Plugin Scenario
Think of how to extract the title text of an HTML document. You would probably first want to locate the <TITLE>
start tag, then collect all text up to the </TITLE>
end tag. Quite simple by itself, no? But if you have a complicated parsing process already underway, it is nice to know that DIHtmlParser allows you to keep different things separate.
The TDIHtmlTablesPlugin plugin locates and extracts a HTML document's title in parallel to the main parsing process without you having to do anything. When the plugin has found the title, it stores its text and triggers an application callback. Alternatively, your application can request the title from the plugin when the parsing is done. It is as simple as that.
Ready-Made Plugins
DIHtmlParser ships with a number of plugins all ready to use.
Case Plugin
The TDIHtmlCasePlugin changes tag and attribute names to upper case or lower case. It has been requested by a user to create uniformly formatted HTML and has since been proven useful to many others.
Character Set Plugin
The TDIHtmlCharSetPlugin watches out for character set information in HTML documents and automatically updates the character decoding of the HTML parser. This is usefull if the character set is unknown prior to the parsing or changes in the middle of a document.
E-Mails Plugin
The TDIHtmlEmailsPlugin scans an HTML document for links to e-mail addresses. For each hit it can trigger an application event and / or add the address to an internal list for later retrieval. This plugin should not be abused for an e-mail harvester.
Events Plugin
The TDIHtmlEventsPlugin triggers events for HTML piece. This turns DIHtmlParser into something like an HTML SAX parsers. TDIHtmlEventsPlugin supports tag filtering (as all plugins do), which SAX parsers do not!
Links Plugin
The Links plugin collects all links contained in an HTML document. It is fully customizable and can also trigger an event for each new link.
Table Plugin
The Table plugin keeps tracks of HTML tables encountered during the parsing. Other parsing processes can query the Table plugin about the table cell and column and the table nesting.
Writer Plugin
The Writer plugin automates the writing of HTML data to another HTML document. It writes over 70 different character sets and encodings (144 with DIConverters) and automatically substitutes HTML entities when necessary.