Table of Contents
YuPcre2: Changes from DIRegEx
YuPcre2 is an up to date regular expression library for Delphi with Perl syntax. Directly supports UnicodeString, AnsiString, or UCS4String, as well as UTF-8, and UTF-16.
This document describes the differences and similarities between the new YuPcre2 and the old DIRegEx to help convert existing projects. If you never used DIRegEx or start a new project with YuPcre2, you might skip this document.
YuPcre2 is a new project, not just a drastic update to DIRegEx. A lot has changed, even though some units, classes, and functions carry familiar names. Unfortunately, it was not possible to keep identical identifiers because Delphi rejects them if both YuPcre2 and DIRegEx are installed into the IDE. Overall, DIRegEx
names have changed to DIRegEx2
where possible, which should simplify transition to YuPcre2.
Unit Name Changes
Unit names had to be changed to allow YuPcre2 to be installed into the IDE in parallel with DIRegEx. Unit names start with the YuPcre2
prefix. The native PCRE2 API is in YuPcre2.pas
. DIRegEx
units with class wrappers and helper routines have been renamed to YuPcre2_RegEx2…
:
DIRegEx | YuPcre2 |
---|---|
DIRegEx_Api.pas | YuPcre2.pas |
n/a | YuPcre2OptInfo.pas |
DIRegEx_Reg.pas | YuPcre2Reg.pas |
DIRegEx.pas | YuPcre2_RegEx2.pas |
DIRegEx_Consts.pas | YuPcre2_RegEx2_Consts.pas |
DIRegEx_MaskControls.pas | YuPcre2_RegEx2_MaskControls.pas |
DIRegEx_SearchStream.pas | YuPcre2_RegEx2_SearchStream.pas |
DIRegEx_Utils.pas | YuPcre2_RegEx2_Utils.pas |
Class and Identifier Name Changes
Class names now contain “RegEx2” the number 2 is appended to “RegEx”. Most members, helper routines and identifier names are unchanged. Deprecated warnings are issued where appropriate.
TDIRegEx2Base.CompileOptions
is empty by default. In DIRegEx, coCaseLess
and coDotAll
were set by default. YuPcre2 excludes them for compatibility with PCRE2. If matching relies on these options, set them like this:
{ Set YuPcre2 CompileOptions to DIRegEx default: } RegEx.CompileOptions := [coCaseLess, coDotAll];
TDIRegEx2Base.BSR
and TDIRegEx2Base.NewLine
options are new properties of their own. In DIRegEx they were be part of the CompileOptions
and MachOptions
. As a consequence, BSR
and NewLine
options can no longer be passed to CompileMatchPatternStrOpt
but must be set beforehand.
PCRE2 Native API Changes
- Names of the native API functions start with the “pcre2_” prefix. The “_8”, “_16”, and “_32” suffixes denote the width of the function's string code unit in bits.
- Many names have been changed; in particular,
pcre_exec
has becomepcre2_match
. ThePCRE_JAVASCRIPT_COMPAT
option has been split into independent functional optionsPCRE2_ALT_BSUX
,PCRE2_ALLOW_EMPTY_CLASS
, andPCRE2_MATCH_UNSET_BACKREF
. - Patterns, subject strings, and replacement strings may all contain binary zeros and for this reason are always passed as a pointer and a length. However, the length may be given as
PCRE2_ZERO_TERMINATED
for zero-terminated strings. - The output vector that holds offsets of matched strings is now a vector of
PCRE2_SIZE
elements instead of Integers. The special valuePCRE2_UNSET
is used for unset elements. - Error handling has been redesigned and error messages are available in all code unit widths. The error codes have been redesignated.
- Explicit “studying” of compiled patterns has been abolished it now always happens automatically. JIT compiling is done by calling a new function,
pcre2_jit_compile
after a successful return frompcre2_compile
. - The
capture_last
field of thepcre2_callout_block
is now an unsigned integer, set to zero if there have been no captures. - Saving / restoring a compiled pattern is accomplished by a set of serializing functions.
- There is a new function called
pcre2_substitute
that performs “find and replace” operations. - Implement the
PCRE2_NO_DOTSTAR_ANCHOR
,PCRE2_NEVER_BACKSLASH_C
, andPCRE2_ALT_CIRCUMFLEX
options.
PCRE2 Funcionality Changes
- Patterns may start with
(*NOTEMPTY)
or(*NOTEMPTY_ATSTART)
to set thePCRE2_NOTEMPTY
orPCRE2_NOTEMPTY_ATSTART
options for every subject line that is matched by that pattern. - For the benefit of those who use PCRE2 via some other application, that is, not writing the function calls themselves, it is possible to check the PCRE2 version by matching a pattern such as
(?(VERSION>=10)yes|no)
against a string such as “yesno”. - There are case-equivalent Unicode characters whose encodings use different numbers of code units in UTF-8. U+023A and U+2C65 are one example. (It is theoretically possible for this to happen in UTF-16 too.) If a backreference to a group containing one of these characters was greedily repeated, and during the match a backtrack occurred, the subject might be backtracked by the wrong number of code units. For example, if
^(\x{23a})\1*(.)
is matched caselessly (and in UTF-8 mode) againstx{23a}\x{2c65}\x{2c65}\x{2c65}
, group 2 should capture the final character, which is the three bytes E2, B1, and A5 in UTF-8. Incorrect backtracking meant that group 2 captured only the last two bytes. This bug has been fixed; the new code is slower, but it is used only when the strings matched by the repetition are not all the same length. - Update Unicode to 8.0.0.
- A pattern such as
()a
was not setting the “first character must be 'a'” information. This applied to any pattern with a group that matched no characters, for example:(?:(?=.)|(?<!x))a
. - When an
(*ACCEPT)
is triggered inside capturing parentheses, it arranges for those parentheses to be closed with whatever has been captured so far. However, it was failing to mark any other groups between the highest capture so far and the currrent group as “unset”. Thus, the ovector for those groups contained whatever was previously there. An example is the pattern(x)|((*ACCEPT))
when matched against “abcd”. - Add the
(*NO_JIT)
pattern feature. - Add callouts with string arguments.