Table of Contents
DIHtmlParser: Version History
DIHtmlParser v8.3.0 – 22 Nov 2023
- Support Delphi 12 Athens Win32 and Win64.
- New
TDIHtmlTablesPlugin.CurrentTable
property. - New
TDIHtmlTable.TableNum
property. - Fix buffer underrun in
DIUri_3986
when cleaning up leading/../
and/..
dot segments on an empty base.
DIHtmlParser 8.2.0 – 16 Sept 2021
- Support Delphi 11 Alexandria Win32 and Win64.
- Update
DIUtils.pas
Unicode functions to Unicode 14.0.0.
DIHtmlParser 8.1.0 – 5 Jun 2020
- Support Delphi 10.4 Sydney Win32 and Win64.
DIHtmlParser 8.0.1 – 30 Oct 2019
Delphi compilers with support for the inline
directive (starting with Delphi 2005) failed to compile DIHtmlParser *.bpl packages for the Demo and Commercial editions. They generated a “[dcc32 Fatal Error] DIUtils: F2051 Unit DIContainers was compiled with a different version of DIUtils.StrSameIW”. Regular *.exe applications compiled without problems. The DIHtmlParser Source Code also compiled to both *.bpl packages and *.exe applications with no problems.
DIHtmlParser 8.0.0 – 8 Oct 2019
Extend character support to the full range of Unicode Code Points from $000000 to $10FFFF.
Up to now, DIHtmlParser stored code points as WideChars. This limited Unicode support to the Basic Multilingual Plane (BMP) from $0000 to $FFFF. Code points from the Supplementary Planes were converted to the $FFFD replacement character. This went well with a great number of languages. But less common scripts did not work, just like the increasingly popular emojis from the Symbols and Pictographs Unicode blocks.
DIHtmlParser 8.0.0 overcomes these limitations and now covers the complete Unicode range. Changes are almost entirely internal and maintain backwards compatibility as much as possible. Existing applications should compile with no or minor changes only. WideChar routines are marked as deprecated and hint at their new complementary UCP routines.
TDIHtmlParser.Data
is still a WideChar buffer. However, its contents is now fully UTF-16 encoded. This means that it may contain code points > $FFFF which take up two WideChars (surrogate pairs). As a result, indexed access to the buffer is no longer guaranteed. TDIHtmlParser.Data
related methods, like TDIHtmlParser.DataAsStrTrimW
are adjusted accordingly.
UnicodeString utility routines are rewritten to handle full UTF-16, including surrogate pairs. Most of them are in DIUtils.pas
. YuUtf.pas
also contains new utility routines for UTF-16 testing, encoding, and decoding. If possible, string handling routines now take NativeInt type parameters for the buffer length.
Other noteworthy changes:
TDIHtmlParser.UCP
complementsTDIHtmlParser.Char
.- The WideChar property
TDIHtmlParser.CustomTagStartChar
has new a UCS4Char complementCustomTagStartUcp
. The same holds forTDIHtmlWriterPlugin.CustomTagStartChar
andCustomTagStartUcp
. TDICustomTag.GetStartCode
has a new UCS4Char overload. So doGetEmptyElementCode
andGetEndCode
.- Changed the type of
TDIHtmlParser.StartCol
,EndCol
,StartLine
,EndLine
,StartPos
, andEndPos
from unsigned Cardinal to signed NativeInt. - Removed conditional compilation directives
DI_No_Classes
andDI_No_Unicode_Component
(source code only).TDIHtmlParser
andTDIHtmlParserPlugin
now always descends fromTComponent
and theClasses
unit is always used. Source code only. - Improve
DIUtils.pas
Unicode processing to support Unicode Code Points from $000000 to $10FFFF. Adjust remaining source code accordingly. - Update
DIUtils.pas
Unicode functions to Unicode 12.1.0. - Delphi 4 and Delphi 5 crash when compiling
DIUtils.pas
. There is no error message, so it is not possible to work around the problem. Support for these compilers is therefore removed. At least Delphi 6 is now required to compile DIHtmlParser. - Remove
DI.inc
include file. Directly link inDICompilers.inc
instead.
DIHtmlParser 7.12.0 – 7 Mar 2019
- Fix potential
TDIUnicodeWriter
memory leak ifTDIUnicodeWriteMethods.Init
allocates its own memory. TDIUnicodeWriter.Clear
callsTDIUnicodeWriteMethods.Flush
to reset encoder state.- KOI8-U converter now maps 0xB4 to U+0404 instead of U+0403.
- Update
DIUtils.pas
Unicode functions to Unicode 12. - Compatibility update with DIConverters 1.18.0. These changes only affect projects using DIConverters:
- Add ISO-2022-CP-MS encoding:
Read_iso_2022_jp_ms
read methods andWrite_iso_2022_jp_ms
write methods. This is recognized byTDIHtmlCharSetPlugin
. - DIConverters converter functions now use the native unsigned integer type for the length of a string and support stings longer than 2 GB.
- UTF-8 converter functions reject surrogates and out-of-range code points, namely the in the ranges 0xD800..0xDFFF and >= 0x110000.
- Fix error handling in UCS-2, UCS-4, and UTF-32 decoder functions.
- Tweak the GB18030 converter functions to map 0x8135F437 to U+E7C7.
- Update the CP1255 converter functions to map 0xCA to U+05BA.
DIHtmlParser 7.11.0 – 24 Dec 2018
- Support Delphi 10.3 Rio Win32 and Win64.
DIHtmlParser 7.10.0 – 3 Apr 2017
- Support Delphi 10.2 Tokyo Win32 and Win64.
DIHtmlParser 7.9.0 – 7 May 2016
- Support Delphi 10.1 Berlin Win32 and Win64.
DIHtmlParser 7.8.0 – 5 Apr 2016
- New
TDIHtmlWriterPlugin.PredefinedEntities
:peLtAttribValue
to encode “<
” as<
in attribute values. Required for XML conformance.peGtAttribValue
to encode “>
” as>
in attribute values.peQuotNum
to encode quotation mark as numeric"
instead of"
.
- Fix:
peAposNum
was not applied to attribute values.
DIHtmlParser 7.7.0 – 3 Mar 2016
- New
TDIHtmlWriterPlugin
properties to force the character used to quote attribute values:QuoteHtmlTagsChar
QuoteCustomTagsChar
QuoteSsiTagsChar
DIHtmlParser 7.6.2 – 15 Sep 2015
- Support Delphi 10 Seattle Win32 and Win64.
DIHtmlParser 7.6.1 – 25 Apr 2015
- Add support for Delphi XE8 Win32 and Win64.
DIHtmlParser 7.6.0 – 3 Oct 2014
- Support Delphi XE7 Win32 and Win64.
- Mark unit
DIUri
as deprecated. TDIHtmlChangeLinksPlugin
uses unitDIUri_3986
instead of the deprecated unitDIUri
.- Improved documentation shows inherited class members.
DIHtmlParser 7.5.0 – 28 Apr 2014
- Support Delphi XE6 Win32 and Win64.
- Minor improvements to demo projects.
DIHtmlParser 7.0.1 – 17 Feb 2014
- Compatibility update with other Yunqa products.
DIHtmlParser 7.0.0 – 25 Sep 2013
- Support Delphi XE5 Win32 and Win64.
DIHtmlParser 6.6.0 – 14 Jun 2013
- Support Delphi XE4 Win32 and Win64.
DIHtmlParser 6.5.1 – 24 Jan 2013
- Compatibility update with other Yunqa products.
DIHtmlParser 6.5.0 – 4 Oct 2012
- Support Delphi XE3 Win32 and Win64.
TDIHtmlCharSetPlugin
: Fix that a second <meta http-equiv> tag which is not a content type does not reset the decoding to the default decoding.- Fix the DIHtmlParser_CharSetConverter demo so that the new character encoding is always written to the document, even if auto-detection is disabled.
DIHtmlParser 6.3.0 – 22 Jun 2012
- HTML5 Updates:
- Add new HTML5 tag and attribute names and IDs, for example
TAG_SECTION
,TAG_SECTION_ID
andATTRIB_PLACEHOLDER
andATTRIB_PLACEHOLDER_ID
. The new HTML5 tags and attributes are automatically registered callingRegisterHtmlTags
andRegisterHtmlAttribs
. - Add new HTML5 named character references, known as entities in HTML4. After calling
RegisterHtmlDecodingEntities
, DIHtmlParser now recognizes all 2231 references listed in the current HTML5 draft. - Parse named character references / entities according to HTML5. In particular, a terminating semicolon
';
' is no longer required. For example,&
is recognized as'&
' just as&
,&
, and&
. - Named character references / entities can now be registered with and without terminating semicolon
';
'. Change: If a terminating semicolon';
' is present,RegisterDecodingEntity
now demands that it must be present in the entity name. TDIHtmlCharSetPlugin
recognizes the new HTML5<meta charset=“name”>
character encoding declaration.
- Add
DIUri_3986.TDIUri.AssignPath
andDIUri_3986.TDIUri.AssignHost
methods, plusDIUri_3986.UritoFileName
withDIUri_3986.TDIUri
URI input and UnicodeString filename output.
DIHtmlParser 6.2.0 – 14 Apr 2012
- Fix: When parsing from
TDIHtmlParser.SourceStream
, the size of the internal source buffer was not correctly calculated. Depending on the decoding, this slowed down reading or even stoped it before the end of the stream was reached. - Fix: Parsing JavaScript, a regular expression character class containing just a single forward slash was not properly terminated.
- New DIUri_3986.pas unit implements URI parsing and resolution according to RCF 3986.
DIUri.UriToFileName
removes 'localhost' from authority, if present. Despite this change,DIUri
is now deprecated. useDIUri_3986
instead.ColorFromHtml
: Improve parsing of #color values, in particular different lengths. Parse non conforming #color values as legacy color values.- Add optional
EmptyAttribValues
parameter (default = false) toTDIHtmlTag.GetCode
,TDIHtmlTag.GetStartCode
,TDIHtmlTag.GetEmptyElementCode
,TDICustomTag.GetCode
,TDICustomTag.GetStartCode
,TDICustomTag.GetEmptyElementCode
,TDISsiTag.GetCode
,TDISsiTag.GetStartCode
,TDISsiTag.GetEmptyElementCode
.
- Work around a compiler warning in
TDIHtmlParser.FillSourceBuffer
(source code edition only).
DIHtmlParser 6.1.1 – 8 Dec 2011
- Relax end-tag parsing for
</script>
and</style>
so they accept attribute content like the other end-tags. This does not strictly conform to the HTML specifications but is sometimes found in real-world HTML. - New
EndLine
,EndCol
, andEndPos
functions determine the end of the current HTML piece.
DIHtmlParser 6.1.0 – 9 Nov 2011
- Support Delphi XE2 Win64.
- Fix AV when sorting empty
TDIVector
or descendents likeTDITag
andTDIHtmlTag
.
DIHtmlParser 6.0.0 – 15 Oct 2011
- Support Delphi XE2 Win32 (binary editions) and Win64 (source code edition only right now).
- Fix a JavaScript parsing endless loop if the script ended with a slahes comment and its
</SCRIPT>
end tag was missing.
DIHtmlParser 5.2.2 – 7 Jul 2011
- Improve handling of comments and CDATA for JavaScipt contents beween
<script>
and</script>
elements.
DIHtmlParser 5.2.1 – 21 Feb 2011
- Parse
<![CDATA[
beginning ofptCDataSection
case-sensitively, as per specification. - Parse
<![CDATA[
…]]>
sections separately inside JavaScript comments. This fixes a problem with pages that use a commented CDATA section inside a script element but do not properly close this comment before the closing</script>
end tag. Such end tags are now recognized by DIHtmlParser. - ExtractText demo works better with Delphi Unicode versions.
- Library source code compiles with FreePascal (Win32).
DIHtmlParser 5.2.0 – 28 Sep 2010
- Delphi XE support.
- Fix DIHtmlParser_ColoredCode demo for Unicode Delphis.
DIHtmlParser 5.1.2 – 24 Apr 2010
- New
TDICustomHtmlWriterPlugin
intermediate interface for greater flexibilty in customizingTDIHtmlWriterPlugin
. - New
TDIHtmlParser.DataAsStrTrim8
convenience method. - Change case of HTML tag constants to lower case. This achieves slightly better results for HTML compression.
- Bring DIHtmlParser_BookmarkParser demo up to date with latest Mozilla and Chrome bookmark files.
- Improved documentation layout.
DIHtmlParser 5.1.1 – 17 Dec 2009
- Additions and bug fixes to
DIUtils.pas
.
DIHtmlParser 5.1. – 14 Sep 2009
- Delphi 2010 support.
- Added the following
TDIHtmlParser
parsing options:TDIHtmlParser.EnableComments
.TDIHtmlParser.EnableEntities
.TDIHtmlParser.EnableExclamationMarkups
.
- Allow custom tag attributes from a wider range of characters than for HTML tags.
- New DIHtmlParser_MailMerge demo.
DIHtmlParser 5.0.1 – 31 Jan 2009
TDIHtmlParser
: When parsing JavaScript, a forward slash “/” inside a regular expression character class was not recognized as such and could lead to an infinite loop.TDIHtmlCharSetPlugin
: Correct decoding function for “GBK” encoding which did not read the 1 to 127 character range.- Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in
DIUtils.pas
which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.
DIHtmlParser 5.0.0 – 24 Nov 2008
- Delphi 2009 support.
DIHtmlParser 4.5.0 – 30 Jul 2008
TDIHtmlTag
,TDICustomTag
,TDISsiTag
:.ConCatValue
must not escape a '&' character in an attribute value immediately followed by a '{' character (HTML 4.0.1 Section B.7.1).- Multiple fixes for filtering, most notable for TDITagFilters.SetStart.
- Better HTML title parsing according to how FireFox does it.
TDIHtmlParser.TrimAttribValues
behaved exactly opposit as intended.- Modify DIHtmlParser_C6.bpk so that it should compile and install again with C++ Builder 6.
- CharSetConverter demo: Add BOM detection.
- ExtractText demo: Optional Unicode output controlled by compiler directive. Also add more tags to improve HTML → Text conversion.
- WebDownload demo: Improve generation of document names if URI has a query part.
- WriterPlugin demo: Support DIHtmlParser1.EnableHtmlTags.
- Some new, simple console demos inspired by support questions.
- Improve compatibility for parallel installation with other DI packages.
- Some code cleanup.
DIHtmlParser 4.4.1 – 15 May 2007
- Add some missing units to the DIHtmlParser *.dpk packages so suppress irritating hints during compilation.
DIHtmlParser 4.4.0 – 13 May 2007
- Delphi 2007 support.
- New HTML parser plugins:
TDIHtmlLinksPlugin2
.TDIHtmlCollectLinksPlugin
.TDIHtmlChangeLinksPlugin
.
- Compatibility with DIConverters 1.11. If you are using DIHtmlParser with DIConverters and encounter incompatibility problems after upgrading to this new version, be sure to use the new version of DIConverters as well.
- Add XP Themes to Demo projects.
DIHtmlParser 4.3.1 – 20 Jun 2006
- Fixed a problem when parsing certain kinds of regular expression escapes in JavaScript.
- Reduced memory requirements for quickly skipping over JavaScript.
- Fixed filtering bugs in
TDIHtmlParser.FindHtmlTag
,TDIHtmlParser.FindSsiTag
, andTDIHtmlParser.ParseNextHtmlTag
.
DIHtmlParser 4.3 – 28 Dec 2005
- Added compatibility with Delphi 2006 Win32.
DIHtmlParser 4.2 – 14 Oct 2005
- New
TDIHtmlParser.EnableHtmlTags
property which controls if HTML tags are properly recognized as such or are simply treated as text. Ignoring HTML tags can be useful for HTML scripting. - New
TDIHtmlParser.TrimAttribValues
property which controls if whitespace are automatically trimmed when parsing the attribute values of tags. - Improved parsing of CustomTags and ASP.
- Fixed an error which could prematurely stop TDIUnicodeReader when a pushed source was popped at the end of a nested document.
- Added Delphi 3 compatibility to the utility units.
- Resolved dependency issues when DIHtmlParser is used in parallel with other DI products.
DIHtmlParser 4.1.1 – 2 Sep 2005
- Eliminated some compiler warnings regarding C++ Builder compatibility.
- Fixed a small packaging bug in the Demo edition which unfortunately slipped into the last update.
DIHtmlParser 4.1 – 31 Aug 2005
- Improved parsing of script contents:
- Extended the internal JavaScript parser in order to improve the recognition of '/…/' regular expressions within JavaScipt. Due to the nature of the JavaScript syntax, there is no 100% save way to tell the difference between '/' as a divisor sign and '/' as the beginning of a regular expression, but the algorithms applied does a pretty good job and fixes a problem which occured with certain HTML documents.
- The new advanced JavaScript parsing is now the default, unless the the script is identified as not being JavaScript.
- The appropriate <META …> tag is being read to determine the default scripting language. The current content script type is available via the TDIHtmlParser.ContentScriptType property.
- New
TDIHtmlParser.DefaultContentScriptType
property to determine the content script type from outside the HTML document.
- Compatibility with other DI products.
DIHtmlParser 4.0 – 14 Apr 2005
- Added the options to link DIHtmlParser against DIConverters, which enables DIHtmlParser to read and write 130+ character encodings.
- Added native Pascal implementation for reading / decoding and writing / encoding the following character sets:
- Mac Arabic, Mac Dingbats, Mac Central Europe, Mac Croatian, Mac Cyrillic, Mac Farsi, Mac Greek, Mac Hebrew, Mac Iceland, Mac Roman, Mac Romanian, Mac Thai, Mac Turkish
- UCS-2 LE, CS-2 BE,
- UCS-4 LE, UCS-4 BE
- UTF-32 LE, UTF-32 BE
- UTF-7 (
Write_UTF_7
/Read_UTF_7
) - UTF-7 Optional Direct Characters (
Write_UTF_7_ODC
/ reads asRead_UTF_7
) - JIS X0201, NextStep, TIS 620
- Improved the parser's handling of malicious markup frequently used in Spam E-Mail: The parser now treats invalid tags (like
'<k$R>
') as HTML Tags instead of Text. There is also a new piece typeptExclamationMarkup
covering inserts starting with an exclamation mark like'<!A>
'. It is returned for the character patterns'<! … >
' which are not Comments, CData Sections, Document Templates, or SSI. - Improved parsing of non-conformant XML Processing Instruction (XmlPI), marked as
'<?XML Char* ?>
'. By specification, XmlPI must terminate with'?>
', but the'?
' is sometimes missing. Specification conformant parsing would then cause DIHtmlParser unintentionally to interpret lengthy stretches as XmlPI. This is now fixed by recognizing both variants as ending an XmlPI. - Improved the recognition of HTML entities lacking a terminating semicolon character (like
' 
') in some cases. - Added mapping of some illegal but commonly used HTML numeric entities into their appropriate Unicode value.
- Changed the
TDIHtmlParser.StopParseAll
procedure to aTDIHtmlParser.StopParse
property. This must be set toTrue
to stop the current parsing process. It applies to bothTDIHtmlParser.ParseAll
as well as toTDIHtmlParser.ParseNextPiece
, where it cancels an ongoing parsing process which did not yet return to the caller. - Introduced
TDIAbstractHtmlAttribsPlugin
as ancestor class ofTDIHtmlLinksPlugin
, which now responds to a much wider range of link combinations, including multiple links contained within a single tag. Applications can also add custom Tag / Attribute combinations to report by callingTDIAbstractHtmlAttribsPlugin.AddAttrib
. TheTDIHtmlLinksPluginEvent
callback definition has changed slightly and requires an interface change to existing applications. - Added a
TDIHtmlWriterPlugin.PredefinedEntities
option which allows to specify some known predefined entities which will alway be encoded by default when writing HTML text, regardless of other entity registrations. - Shortened procedure name of
TDITag.ForceAttribValue
toTDITag.ForceAttrib
. TDITag
and descendent classes benefit from changes toDIContainers
ancestors. This includes speed optimizations as well as some interface simplifications.