DIRegEx is a library of components and procedures that implement regular expression pattern matching using the same syntax and semantics as Perl for Delphi (Embarcadero / CodeGear / Borland).
In mid 2015, DIRegEx has been superseded by the YuPcre2 Delphi Regular Expression Library. Since then, both DIRegEx and YuPcre2 have been developed in parallel. DIRegEx continued to receive security fixes, optimizations, and even improvements. Development of new features, however, took place in YuPcre2.
Beginning with DIRegEx 8.12.0 (25 Mar 2020), development focuses entirely on YuPcre2 and DIRegEx received just very few updates. Starting with this version, DIRegEx updates will be even less. This may even be the last release.
Users of DIRegEx are encouraged to deploy YuPcre2 for the newest Delphi regular expression technology. To ease migration, the YuPcre2 units and components are named as closely as possible to those of DIRegEx. A demo is available to start the conversion of existing projects, including testing.
Changes in this version:
(*MARK)
bug in the interpreter.(?C
as it is read, in order to avoid integer overflow.DI.inc
include file. Directly link in DICompilers.inc
instead.DI_No_RegEx_Component
compiler directive. TDIRegEx
always descend from TComponent
. Source code only.DI_No_Classes
compiler directive. The Classes
unit is always used. Source code only.TDIRegEx.Replace
and TDIRegEx16.Replace
did not return the start of the string if StartOffset > 0.TDIRegExSearchStream_Enc
to DIConverters 1.18.0: Converter functions now use the native unsigned integer type for the length of a string and support stings longer than 2 GB. This change only affects projects using DIConverters.[^\x{100}-\x{ffff}]*[\x80-\xff]
which has a repeated negative class with no characters less than 0x100 followed by a positive class with only characters less than 0x100, the first class was incorrectly being auto-possessified, causing incorrect match failures.(?(1)^())b
or (?(?=^))b
.\X
or \R
has a greater than 1 fixed quantifier.(?&xxx)*ABC(?<xxx>XYZ)
would (incorrectly) expect 'A' to be the first character of a match.pcre_dfa_exec
, in UTF mode when UCP support was not defined, there was the possibility of a false positive match when caselessly matching a “not this character” item such as [^\x{1234}]
(with a code point greater than 127) because the “other case” variable was not being initialized.pcre_jit_exec
checks whether the pattern is compiled in a given mode, it was also expected that at least one mode is available. This is fixed and pcre_jit_exec
returns with PCRE_ERROR_JIT_BADOPTION
when the pattern is not optimized by JIT at all.(?=(a))\1?b
, “b” was incorrectly set as the first character of a match..
against an empty string when the newline type is CRLF.(*UTF)\C[^\v]+\x80
against an 8-bit string containing multi-code-unit characters caused bad behaviour and possibly a crash.(?#
style comment is now ignored between a basic quantifier and a following '+' or '?' (example: X+(?#comment)?Y
.pcre_dfa_exec
misbehaved if it encountered a character class with a possessive repeat, for example [a-f]{3}+
.PCRE_UCP
set, a negative character type such as \D
in a positive class should cause all characters greater than 255 to match, whatever else is in the class. There was a bug that caused this not to happen if a Unicode property item was added to such a class, for example [\D\P{Nd}]
or [\W\pL]
.(?<RA>abc)(?(R)xyz)
was incorrectly compiled such that the conditional was interpreted as a reference to capturing group 1 instead of a test for recursion. Any group whose name began with R
was misinterpreted in this way. (The reference interpretation should only happen if the group's name is precisely R
.)PCRE_NO_START_OPTIMIZE
was not set:(?=.*X)X$
was incorrectly optimized as if it needed both an initial 'X' and a following 'X'..*
were incorrectly optimized as having to match at the start of the subject or after a newline. There are cases where this is not true, for example, (?=.*[A-Z])(?=.{8,16})(?!.*[\s])
matches after the start in lines that start with spaces. Starting .*
in an assertion is no longer taken as an indication of matching at the start (or after a newline).PCRE_AUTO_CALLOUT
was set on a pattern that had a (?#
comment between an item and its qualifier (for example, A(?#comment)?B)
pcre_compile
misbehaved.\E
was present between an item and its qualifier when PCRE_AUTO_CALLOUT
was set, pcre_compile
misbehaved.[^[:^ascii:]\d]
were not working correctly in UCP mode.\Q\E
sequence between an item and its qualifier caused pcre_compile
to misbehave when auto callouts were enabled.PCRE_EXTENDED
started with white space or a #
-type comment that was followed by (?-x)
, which turns off PCRE_EXTENDED
, and there was no subsequent (?x)
to turn it on again, pcre_compile
assumed that (?-x)
applied to the whole pattern and consequently mis-compiled it.pcre_copy_named_substring
for a named substring whose number was greater than the space in the ovector could cause a crash.pcre_get_substring_list
crashed if the use of \K
in a match caused the start of the match to be earlier than the end.(*ACCEPT)
in the middle of a sufficiently deeply nested set of parentheses of sufficient size caused an overflow of the compiling workspace (which was diagnosed, but of course is not desirable).(?(?C)0
was not diagnosing an error (“assertion expected”) when (?(?C)
was not followed by an opening parenthesis.[\W\p{Any}]
where both a negative-type escape (“not a word character”) and a property escape were present, the property escape was being ignored.(*MARK)
or (*THEN)
names.[[:punct:]b]
that is, a POSIX character class followed by a single ASCII character in a class item, was incorrectly compiled in UCP mode. The POSIX class got lost, but only if the single character followed it.[:punct:]
in UCP mode was matching some characters in the range 128-255 that should not have been matched.[:^ascii:]
or [:^xdigit:]
or [:^cntrl:]
are present in a non-negated class, all characters with code points greater than 255 are in the class. When a Unicode property was also in the class (if PCRE_UCP
is set, escapes such as \w
are turned into Unicode properties), wide characters were not correctly handled, and could fail to match.(?1)()((((((\1++))\x85)+)|))
.(?J)(?'d'(?'d'\g{d}))
.(?|(\k'Pm')|(?'Pm'))
, caused a buffer overflow at compile time.(?J:(?|(?'R')(\k'R')|((?'R'))))
.(?123)
.\l
in EBCDIC environments was incorrect, leading to its being treated as a literal 'l' instead of causing an error.pcre_exec
was called with an ovector of size 1.^(?:(?(1)x|)+)+$()
.(?=di(?⇐(?1))|(?=(.))))
.(?(R))*+
, was incorrectly compiled.(?:|a|){100}x
are analysed.[:
and \\
were incorrectly compiled and could cause reading from uninitialized memory or an incorrect error diagnosis.[:
caused pcre_compile
to run for a very long time.(?R
was followed by -
or +
incorrect behaviour happened instead of a diagnostic.(?|
it is possible for an apparently non-recursive back reference to become recursive if a later named group with the relevant number is encountered. This could lead to a buffer overflow.(?(<digits>)
and (?(R<digits>)
.{0,1}
repeat byte code..((?2)(?R)\1)()
, pcre_compile
failed to compile correct code, leading to undefined behaviour or an internally detected error.(?⇐\Ka)
) could make pcregrep loop.\X
was preceded by \C
in UTF mode (e.g. \C\X*
), and a subsequent item in the pattern caused a non-match, backtracking over the repeated \X
did not stop, but carried on past the start of the subject, causing reference to random memory and/or a segfault. There were also some other cases where backtracking after \C
could crash.((?2){73}(?2))((?1))
. A better mutual recursion detection method has been implemented./K
and /F
were both set with the option to save the compiled pattern.\O\C+
as a subject string.((?2){0,1999}())?
, which has a group containing a forward reference repeated a large (but limited) number of times within a repeated outer group that has a zero minimum quantifier, caused incorrect code to be compiled, leading to the error “internal error: previously-checked referenced subpattern not found” when an incorrect memory address was read.(?i)[A-`]
, where characters in the other case are adjacent to the end of the range, and the range contained characters with more than one other case, caused incorrect behaviour when compiled in UTF mode. In that example, the range a-j was left out of the class.(*FAIL)
. E.g: (?(?!))
.(?(?!)^)
caused references to random memory.(?!)
is optimized to (*FAIL)
. This was not handled correctly when this assertion was used as a condition, for example (?(?!)a|b)
. In pcre_exec
it worked by luck; in pcre_dfa_exec
it gave an incorrect error about an unsupported item.Z*(|d*){216}
, the auto-possessification code could take exponential time to complete. A recursion depth limit of 1000 has been imposed to limit the resources used by this optimization.(*UTF)[\S\V\H]
, which contains a negated special class such as \S
in non-UCP mode, explicit wide characters (> 255) can be ignored because \S
ensures they are all in the class. The code for doing this was interacting badly with the code for computing the amount of space needed to compile the pattern, leading to a buffer overflow.((?2)+)((?1))
which has mutual recursion nested inside other kinds of group caused stack overflow at compile time.(?1)(?#?'){8}(a)
which had a parenthesized comment between a subroutine call and its quantifier was incorrectly compiled, leading to buffer overflow or other errors.(?(?<E>.*!.*)?)
was not being diagnosed as missing an assertion after (?(
. The code was failing to check the character after (?(?<
for the !
or =
that would indicate a lookbehind assertion.X((?2)()*+){2}+
which has a possessive quantifier with a fixed maximum following a group that contains a subroutine reference was incorrectly compiled and could trigger buffer overflow.(?⇐((?2))((?1)))
caused a stack overflow instead of the diagnosis of a non-fixed length lookbehind assertion.[^\xff]((?1))
.^(?:(a)|b)(?(1)A|B)
, is another kind of back reference, but it was not setting the highest backreference number. This mattered only if pcre_exec
was called with an ovector that was too small to hold the capture, and there was no other kind of back reference (a situation which is probably quite rare). The effect of the bug was that the condition was always treated as FALSE when the capture could not be consulted, leading to a incorrect behaviour by pcre_exec
. This bug has been fixed.PCRE_DUPNAMES
was not set caused the amount of memory needed for the pattern to be incorrectly calculated, leading to overwriting.(\2)(\1)
caused a segfault at study time (while trying to find the minimum matching length). The infinite loop is now broken (with the minimum length unset, that is, zero).(?(?=0)?)+
is an example that caused this. Perl allows assertions to be quantified, but not if they are being used as conditions, so the above pattern is faulted by Perl. PCRE has now been changed so that it also rejects such patterns.(a)*+
with a minimum repeat of zero failed to allow the zero-repeat case if pcre_exec
was called with an ovector too small to capture the group.(*ACCEPT)
within a capturing group. When an (*ACCEPT)
is triggered inside capturing parentheses, it arranges for those parentheses to be closed with whatever has been captured so far. However, it was failing to mark any other groups between the hightest capture so far and the currrent group as “unset”. Thus, the ovector for those groups contained whatever was previously there. An example is the pattern (x)|((*ACCEPT))
when matched against “abcd”.((?(R)a|(?1)))+
, which contains a recursion within a group that is quantified with an indefinite repeat, caused a compile-time loop which used up all the system stack and provoked a segmentation fault which could cause the application to crash.Improvements
TDIRegEx.FormatSubStrChar
property.TDIRegEx.FormatOptions
and TDIRegEx16.FormatOptions
properties.Occurrence
argument for TDIRegEx.Replace2
and TDIRegEx16.Replace2
.Bug Fixes
(((a\2)|(a*)\g←1>))*
and other examples caused segmentation faults because of stack overflows at compile time.PCRE_ERROR_MATCHLIMIT
. This issue should be resolved now.\w+
in the pattern ^\w+(?>\s*)(?⇐\w)
which caused it not to match “test test”.\o{}
(as Perl does) and for \x{}
(which Perl does not).(?P<Name>a)?(?P<Name2>b)?(?(<Name>)c|d)*l
the reference to 'Name' was incorrectly treated as a reference to a duplicate name.^s?c
where the optional character has more than one “other case” was incorrectly compiled such that it would only try to match starting at “c”.\s
was studied, VT was not included in the list of possible starting characters.[\Qx]…
where x is any character, the class was incorrectly terminated at the ]
.\K
in the pattern.pcre_stack_guard
that can be set to point to an external function to check stack availability. It is called at the start of processing every parenthesized group.s
and \x{17f}
are both alternative cases for S
: the class [RST]
was handled correctly, but [R-T]
was not.a{1,3}b
did not match “ab”.DI_No_RegEx_Range_Checking
compiler directive.WildCardToPcre
, WildCardToPcreA
, and WildCardToPcreW
replace the overloaded WildCardToPcre
. Delphi 5 could not figure out the correct overload.coNoAutoPossess
compile option added.TDIRegEx.InfoMatchEmpty
/ TDIRegEx16.InfoMatchEmpty
which yields 1 if the pattern can match an empty string.pcre_jit_free_unused_memory
, pcre16_jit_free_unused_memory
, and pcre32_jit_free_unused_memory
forcibly free unused JIT executable memory.A{4}+
ignored the possessivenes of the quantifier (because it's meaningless), this was not happening when coCaseLess
was set. Not wrong, but inefficient.coNeverUtf
to lock out the use of UTF, in particular, blocking (*UTF)
etc.[[:<:]]
and [[:>:]]
as used in the BSD POSIX library to mean “start of word” and “end of word”, respectively, as a transition aid.\s
and are generally treated as white space, following this same change in Perl 5.18. There is now no difference between “Perl space” and “POSIX space”.\8
and \9
. If there is no previously encountered capturing group of those numbers, they are treated as the literal characters 8 and 9 instead of a binary zero followed by the literals. DIRegEx now does the same.\o{}
to specify codepoints in octal, making it possible to specify values greater than 0777 and also making them unambiguous.\s
was not matching two of the characters that Perl matches, namely NEL (U+0085) and MONGOLIAN VOWEL SEPARATOR (U+180E), though they were matched by \h
.[:graph:]
, [:print:]
, and [:punct:]
when coUCP
is set so as to include the same characters as Perl does in Unicode mode.\x{…
instead of treating the string as literal. DIRegEx now does the same.[A-\d]
or [a-[:digit:]]
now cause compile-time errors. Perl warns for these when in warning mode, but DIRegEx has no facility for giving warnings.+
that indicates a possessive quantifier. DIRegEx allowed a space before the quantifier, but not before the possessive +
. It now does.\s
and are generally treated as white space, following this same change in Perl 5.18. There is now no difference between “Perl space” and “POSIX space”. Whether VT is treated as white space in other locales depends on the locale.[^\x{100}]*
in UTF-8 mode). The effect was undefined: the group might or might not be deemed as matching an empty string, or the program might have crashed.\h
, \H
, \v
, \V
, and \R
must match a character.\X*
or \X{2,5})
when matched by pcre_exec() without using JIT:(?(?=ab)ab)
recorded “a” as a first data item, and therefore matched “ca” after “c” instead of at the start.coAutoCallout
and coUCP
were set for a pattern that contained character types such as \d
or \w
, too many callouts were inserted, and the data that they returned was rubbish.\K
(reset reported match start) within a repeated possessive group such as (a\Kb)*+
was not working.TDIRegEx.InfoSize
, TDIRegEx.InfoStudySize
, and TDIRegEx.InfoJitSize
output type from Cardinal
to C_size_t
, which is different in Win32 and Win64. If you are compiling for Win64, please adjust your code.TDICustomRegExSearch
did not find some matches on buffer boundaries.(*PRUNE)
, (*SKIP)
and (*THEN)
are now supported.\p{Xuc}
for matching characters that can be expressed in certain programming languages using Universal Character Names.(*LIMIT_MATCH)
and (*LIMIT_RECURSION)
.(*SKIP)
and (*PRUNE)
to within a recursive subpattern, this has now been done, just as with (*COMMIT)
.(*COMMIT)(*SKIP)
, COMMIT
would override SKIP
. Now, PCRE acts on whichever backtracking verb is reached first by backtracking. In some cases this makes it more Perl-compatible, but Perl's rather obscure rules do not always do the same thing.(*ACCEPT)
. Again, this sometimes improves Perl compatibility, and sometimes does not.capture_last
that is passed to callouts was incorrect in some cases when there was a capture on one path that was subsequently abandoned after a backtrack. Also, the capture_last value is now reset after a recursion, since all captures are also reset in this case.offsets[2]
to the “bumpalong” value, that is, the offset of the starting point of the matching process, provided the offsets vector is large enough.\A
escape now records a lookbehind value of 1, though its execution does not actually inspect the previous character. This is to ensure that, in partial multi-segment matching, at least one character from the old segment is retained when a new segment is processed. Otherwise, if there are no lookbehinds in the pattern, \A
might match incorrectly at the start of a new segment.PCRE_AUTO_CALLOUT
) and there was a conditional group that depended on an assertion, if the assertion was false, the callout that immediately followed the alternation in the condition was skipped when pcre_exec
was used for matching.(?⇐(*SKIP)ac)a
is matched against aa
.pcre_dfa_exec
is called with an output vector of length less than 2.New Features:
\X
so that it now matches a Unicode extended grapheme cluster.PCRE_STUDY_EXTRA_NEEDED
.(*UTF)
can now be used to start a pattern in any of the three libraries.Bug Fixes:
.*
appeared inside atomic brackets at the start of a pattern, or where there was a subsequent *PRUNE
or *SKIP
.pcre_exec
or pcre_dfa_exec
was called with a negative value for the subject string length, the error given was PCRE_ERROR_BADOFFSET
, which was confusing. There is now a new error PCRE_ERROR_BADLENGTH
for this case.\X
so that it now matches a Unicode extended grapheme cluster.PCRE_STUDY_EXTRA_NEEDED
.\s*\R
was auto-possessifying the \s*
when it should not, whereas \S*\R
was not doing so when it should.PCRE_UCP
was not set, \w+\x{c4}
was incorrectly auto-possessifying the \w+
when the character tables indicated that \x{c4}
was a word character. There were several related cases, all because the tests for doing a table lookup were testing for characters less than 127 instead of 255.MARK
name and give an error at compile time. The limit is 255 for the 8-bit library and 65535 for the 16-bit library.((?:a?)*)*c
or ((?>a?)*)*c
was matched against “aac”, it set group 1 to “aa” instead of to an empty string. The bug affected repeated groups that could potentially match an empty string.\uxxxx
in JavaScript mode are now subject to the same checks as \x{…}
characters in non-JavaScript mode. Specifically, codepoints that are too big for the mode are faulted, and in a UTF mode, disallowed codepoints are also faulted.TDIDfaRegEx
could cause incorrect processing when bytes with values greater than 127 were present. For TDIDfaRegEx16
, the bug would be provoked by values in the range 0xfc00 to 0xdc00. In both cases the values are those that cannot be the first data item in a UTF character. The bug showed with recursions, possessively repeated groups, and atomic groups.\h*
or \R*
might have been incorrectly matched..*
appeared inside atomic brackets at the start of a pattern, or where there was a subsequent *PRUNE
or *SKIP
, the start of string (or line, in multiline mode) was determined incorrectly.moNoStartOptimize
/ PCRE_NO_START_OPTIMIZE
in JIT as (*MARK)
support requires it.DIRegEx_Workbench_Form.pas
: Add missing array elements for coNoUtf16Check
and moNoUtf16Check
introduced in yesterday's release.TDIRegExBase.CompileOptions
and TDIRegExBase.MatchOptions
for component streaming instead of their CompileOptionBits
and MatchOptionBits
counterparts. Existing forms are updated automatically.TDIRegExBase.SetCompileOptions
and TDIRegExBase.SetMatchOptions
are now protected and virtual.TDIRegExSearchStream_Enc
and TDIRegExSearchStream_Utf8
compile and match option setters overloaded to include UTF-8 options.^(..)\1
did not partially match “aba” because checking references was done on an “all or nothing” basis. This also applied to repeated references.\R
did not give a hard partial match if CR was found at the end of the subject.\X
did not give a hard partial match after matching one or more characters at the end of the subject.a$
did not recognize a partial match for the string CR.(*MARK)
control verb is now supported by the JIT compiler.PCRE_INFO_MAXLOOKBEHIND
plus TDIRegEx.InfoMaxLookBehind
, TDIRegEx16.InfoMaxLookBehind
.(*COMMIT)
is now confined to within a recursive subpattern call.(*COMMIT)
is now confined to within a positive assertion.(*COMMIT)
control verb is now supported by the JIT compiler.TDIPerlRegEx.SubStrCount
, TDIPerlRegEx16.SubStrCount
, TDIDfaRegEx.SubStrCount
, TDIDfaRegEx16.SubStrCount
for partial match results.coNoUtf16Check
and match option moNoUtf16Check
.TDIPerlRegEx16
and TDIDfaRegEx16
. Both work on UnicodeStrings and WideStrings natively with not prior conversions. Full UTF-16 Unicode processing optional.(*SKIP)
was given with a name that did not match a (*MARK)
, and the match failed at the start of the subject, a reference to memory before the start of the subject could occur.(another)?(\1?)test
matched against “hello world test”.pcre_exec
and pcre16_exec
(the ovector size rounding is not applied in this particular case).pcre_info
. Use pcre_fullinfo
instead.poAutoJit
or by passing soJIT
to TDIRegEx.Study
.(?(?=c)c|d)++
was being incorrectly compiled and would have given unpredicatble results.(A){2,}+
behaved as if it was (A)(A)++
which meant that, after a subsequent mismatch, backtracking into the first (A)
could occur when it should not.\C
is now supported in lookbehinds and DFA matching.\N
without a following name in a []
class; DIRegEx now also gives an error.(a)(?2){2}(.)
was incorrectly expecting the subject to contain another “a” after the start.(*SKIP:name)
is activated without a corresponding (*MARK:name)
earlier in the match, the SKIP
should be ignored. This was not happening; instead the SKIP
was being treated as NOMATCH. For patterns such as A(*MARK:A)A+(*SKIP:B)Z|AAC
this meant that the AAC
branch was never tested.(*MARK)
, (*PRUNE)
, and (*THEN)
has been reworked and is now much more compatible with Perl, in particular in cases where the result is a non-match for a non-anchored pattern. For example, if b(*:m)f|a(*:n)w
is matched against “abc”, the non-match returns the name “m”, where previously it did not return a name. A side effect of this change is that for partial matches, the last encountered mark name is returned, as for non matches. The refactoring has had the pleasing side effect of it stack requirements.(*ACCEPT)
, (*COMMIT)
, (*FAIL)
, (*MARK)
, (*PRUNE)
, (*SKIP)
, (*THEN)
, \h
, \H
, \v
, \V
, and single character negative classes with fixed repetitions, e.g. [^a]{3}
, with and without coCaseLess
.\x
, \U
and \u
in JavaScript compatibility mode based on the ECMA-262 standard.(?⇐a{2}b)
that contained a fixed repetition were erroneously being rejected as “not fixed length” if coCaseLess
was set.TDIRegEx
classes (TDIPerlregEx
, TDIDFARegEx
) are not affected, but applications using the low level PCRE API might need small adjustments.(*MARK)
settings inside atomic groups that do not contain any capturing parentheses, for example, (?>a(*:m))
, were not being passed out. This bug was introduced in DIRegEx 6.0.0.(a)b|ac
is matched against “ac”, there is no captured substring, but while checking the failing first alternative, substring 1 is temporarily captured. If the output vector supplied to pcre_exec
was not big enough for this capture, the yield of the function was still zero (“insufficient space for captured substrings”). This cannot be totally fixed without adding another stack variable, which seems a lot of expense for a edge case. However, I the situation is now improved in cases such as (a)(b)x|abc
matched against “abc”, where the return code indicates that fewer than the maximum number of slots in the ovector have been set.pcre_exec
uses temporary memory during matching, and copies in the captures as far as possible afterwards. It was using the entire output vector, but this conflicts with the specification that only 2/3 is used for passing back captured substrings. Now it uses only the first 2/3, for compatibility. This is, of course, another edge case.pcre_dfa_exec
run exactly filled the ovector, the return from the function was zero, implying that there were other matches that did not fit. The correct “exactly full” value is now returned.(*PRUNE)
or any other control that caused it to give a non-standard return, invalid errors such as PCRE_ERROR_RECURSELOOP
or even infinite loops could occur.a(*SKIP)c|b(*ACCEPT)|
was studied, it stopped computing the minimum length on reaching *ACCEPT
, and so ended up with the wrong value of 1 rather than 0. Further investigation indicates that computing a minimum subject length in the presence of *ACCEPT
is difficult (think back references, subroutine calls), and so the code was changed so that no minimum is registered for a pattern that contains *ACCEPT
.(*ACCEPT)a
was miscompiled, thinking that the first byte in a match must be “a”.(*THEN)
appeared in a group that was called recursively or as a subroutine, it did not work as intended.A (B(*THEN)C) | D
where A, B, C, and D are complex pattern fragments (but not containing any |
characters). If A and B are matched, but there is a failure in C so that it backtracks to (*THEN)
, PCRE was behaving differently to Perl. PCRE backtracked into A, but Perl goes to D. In other words, Perl considers parentheses that do not contain any |
characters to be part of a surrounding alternative, whereas PCRE was treading (B(*THEN)C)
the same as (B(*THEN)C|(*FAIL))
– which Perl handles differently. PCRE now behaves in the same way as Perl, except in the case of subroutine/recursion calls such as (?1)
which have in any case always been different (but PCRE had them first).|
in a conditional group as creating alternatives. Such a group is treated in the same way as an ordinary group without any |
characters when processing (*THEN)
. PCRE has been changed to match Perl's behaviour.TDIRegEx.SubStrMatched
.pcre_info
as deprecated. Use pcre_fullinfo
instead.pcre_exec
and pcre_dfa_exec
when the UTF-8 string check fails, as long as the output vector has at least 2 elements. The offset of the start of the failing character and a reason code are placed in the vector.pcre_compile
, the offset that is now returned is for the first byte of the failing character, instead of the last byte inspected. This is an incompatible change, but it should be small enough not to be a problem. It makes the returned offset consistent with pcre_exec
and pcre_dfa_exec
.\R
was used with a maximizing quantifier it failed to skip backwards over a #13#10
pair if the subsequent match failed. Instead, it just skipped back over a single character (#10
). This seems wrong (because it treated the two characters as a single entity when going forwards), conflicts with the documentation that \R
is equivalent to (?>\r\n|\n|…etc)
, and makes the behaviour of \R*
different to (\R)*
, which also seems wrong. The behaviour has been changed.(abc)++
when using pcre_exec
.(?⇐(a)+)
was not diagnosed as invalid (non-fixed-length lookbehind).(a|)*(?1)
gave a compile-time internal error.((a|)+)+
did not notice that the outer group could match an empty string.(^a|^)+
was not marked as anchored.(.*a|.*)+
was not marked as matching at start or after a newline.(*ACCEPT)
was used in a subpattern that was called recursively, the restoration of the capturing data to the outer values was not happening correctly.(*ACCEPT)
and matched an empty string, and PCRE_NOTEMPTY
was set, pcre_exec
thought the whole pattern had matched an empty string, and so incorrectly returned a no match.(*THEN)
occurring in these branches, tail recursion is no longer possible because the return has to be checked for (*THEN)
. These two optimizations have therefore been removed.\R
was studied, it was assumed that \R
always matched two bytes, thus causing the minimum subject length to be incorrectly computed because \R
can also match just one byte.(*ACCEPT)
was studied, the minimum subject length was incorrectly computed.(*ACCEPT)
was used in an assertion that matched an empty string and PCRE_NOTEMPTY
was set, PCRE applied the non-empty test to the assertion.(?>(a))b|(a)c
when matching “ac” set capturing group 1 to “a”, when in fact it should be unset. This applied to multi-branched capturing and non- capturing groups, repeated or not, and also to positive assertions (capturing in negative assertions does not happen in PCRE) and also to nested atomic groups.pcre_exec
has been changed so that if they are repeated, backtracking one repetition now resets captured values correctly. For example, if ((?>(a+)b)+aabab)
is matched against “aaaabaaabaabab” the value of captured group 2 is now correctly recorded as “aaa”. Previously, it would have been “a”. As part of this code refactoring, the way recursive calls are handled has also been changed.(?(?=(a))a)
was matched against “a”, no capturing was returned.(xxx(?1)|yyy)
but not mutual recursions (where group 1 called group 2 while simultaneously a separate group 2 called group 1). A stack overflow occurred in this case. This is now fixed this by limiting the recursion depth to 10.\X
with an unlimited repeat could fail if at any point the first character it looked at was a mark character.\k
was not followed by a braced, angle-bracketed, or quoted name, PCRE compiled something random. Now it gives a compile-time error (as does Perl).*MARK
encountered during the processing of a positive assertion is now recorded and passed back (compatible with Perl).\g
was being checked for fancy things in a character class, when it should just be a literal “g”.[:a[:digit:]]
whereas Perl was not. It seems that the appearance of a nested POSIX class supersedes an apparent external class. For example, [:a[:digit:]b:]
matches “a”, “b”, “:”, or a digit. Also, unescaped square brackets may also appear as part of class names. For example, [:a[:abc]b:]
gives unknown class “[:abc]b:]”. PCRE now behaves more like Perl.\N
with a braced quantifier such as {1,}
(this was because it thought it was \N{name}
, which is not supported).((?1))((?2))
. There is now a runtime test that gives an error if a subgroup is called recursively as a subpattern for a second time at the same position in the subject string. In previous releases this might have been caught by the recursion limit, or it might have run out of stack.(?(R)a+|(?R)b)
is quite safe, as the recursion can happen only once. PCRE was, however incorrectly giving a compile time error “recursive call could loop indefinitely” because it cannot analyze the pattern in sufficient detail. The compile time test no longer happens when PCRE is compiling a conditional subpattern, but actual runaway loops are now caught at runtime.(*MARK:NAME)
and other backtracking verbs. PCRE has been changed to be the same..(*F)
, when applied to “abc” with PCRE_PARTIAL_HARD
, gave a partial match of an empty string instead of no match. This was specific to the use of “.”.f.*
, if compiled with PCRE_UTF8
and PCRE_DOTALL
and applied to “for” with PCRE_PARTIAL_HARD
, gave a complete match instead of a partial match. This bug was dependent on both the PCRE_UTF8
and PCRE_DOTALL
options being set.\babc|\bdef
pcre_study
was failing to set up the starting byte set, because \b
was not being ignored.PCRE_PARTIAL_HARD
affects the matching of $, \z, \Z, \b, and \B. If the match point is at the end of the string, previously a full match would be given. However, setting PCRE_PARTIAL_HARD
has an implication that the given string is incomplete (because a partial match is preferred over a full match). For this reason, these items now give a partial match in this situation. [Aside: previously, the one case /t\b/ matched against “cat” with PCRE_PARTIAL_HARD
set did return a partial match rather than a full match, which was wrong by the old rules, but is now correct.]PCRE_EXTENDED
is set, when PCRE_NEWLINE_ANY
and PCRE_UTF8
were also set. If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when scanning for the end of the comment. (*Character* 0x85 is an “any” newline, but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several places in pcre_compile
.pcre_compile
was skipping #-introduced comments when looking ahead for named forward references to subpatterns, the only newline sequence it recognized was NL. It now handles newlines according to the set newline convention.pcre_exec
nor pcre_dfa_exec
was checking that the value given as a starting offset was within the subject string. There is now a new error, PCRE_ERROR_BADOFFSET
, which is returned if the starting offset is negative or greater than the length of the string. In order to test this, pcretest is extended to allow the setting of negative starting offsets.PCRE_ERROR_SHORTUTF8
to make it possible to distinguish between a bad UTF-8 sequence and one that is incomplete.PCRE_NO_START_OPTIMIZE
option, which is now allowed at compile time – but just passed through to pcre_exec
or pcre_dfa_exec
. This makes it available to pcregrep and other applications that have no direct access to PCRE options. The new /Y option in pcretest sets this option when calling pcre_compile
.(*MARK:ARG)
and for ARG additions to PRUNE
, SKIP
, and THEN
.(*ACCEPT)
was not working when inside an atomic group.\R
and \X
were always treated as literals, whereas Perl faults them if its -w option is set. Changed so that they fault when coExtra
is set.\N
which always matches any character other than newline. (It is the same as “.” when coDotAll
is not set.)\s
etc use properties. The new properties are: Xan
(alphanumeric), Xsp
(Perl space), Xps
(POSIX space), and Xwd
(word).coUCP
to make \b
, \d
, \s
, \w
, and certain POSIX character classes use Unicode properties. (*UCP)
at the start of a pattern can be used to set this option.coUtf8
mode, if a pattern that was compiled with coCaseLess
was studied, and the match started with a letter with a code point greater than 127 whose first byte was different to the first byte of the other case of the letter, the other case of this starting letter was not recognized.TDIRegEx.Study
now recognizes \h
, \v
, and \R
when constructing a bit map of possible starting bytes for non-anchored patterns.\R
and a number of cases that involve Unicode properties are recognized, both explicit and implicit when coUCP
is set.(?&t)(?#()(?(DEFINE)(?<t>a))
which has a forward reference to a subpattern the other side of a comment that contains an opening parenthesis caused either an internal compiling error, or a reference to the wrong subpattern.\p{Lu}*
) was used with non-UTF-8 input, it could crash or give wrong results if characters with values greater than #$C0 were present in the subject string. (Detail: it assumed UTF-8 input when processing these items.)TDIRegEx.CompileFormatPattern
.DIRegEx_MaskControls
. For two AnsiStrings Str1
and Str2
, these old Delphi versions do not compile Str1 := AnsiString(Str2);
assignments equal to Str1 := Str2;
but instead cause memory disorders further down the stack. Delphi 7 and newer are not affected.OMMIT_PCRE_COMPILE
compiler directive for DIRegEx_Api.pas (source code only).(?&t)*+(?(DEFINE)(?<t>.))
which has a possessive quantifier applied to a forward-referencing subroutine call, could compile incorrect code or give the error “internal error: previously-checked referenced subpattern not found”.\K
giving a compile-time error if it appeared in a lookbehind assersion.\K
was not working if it appeared in an atomic group or in a group that was called as a “subroutine”, or in an assertion. Perl 5.11 documents that \K
is “not well defined” if used in an assertion. DIRegEx now accepts it if the assertion is positive, but not if it is negative.(?P<L1>(?P<L2>0)|(?P>L2)(?P>L1))
in which the only other item in branch that calls a recursion is a subroutine call – as in the second branch in the above example – was incorrectly given the compile-time error “recursive call could loop indefinitely” because pcre_compile
was not correctly checking the subroutine for matching a non-empty string.TDIRegEx
.MatchPattern
property back to AnsiString. This was unfortunately required type to fix a Delphi 2009 / 2010 RawByteString streaming problem.TDIRegEx
.MatchPatternRaw
: RawByteString property to allow Unicode Delphis to set the MatchPattern
without automatic codepage conversion. This is now the recommended MatchPattern runtime property.TDIRegExMaskEdit
and TDIRegExMaskComboBox
now automatically encode text to UTF-8 when their RegEx component is in UTF-8 mode.moNotEmptyAtStart
which makes it possible to have an empty string match not at the start, even when the pattern is anchored.FormatPattern
introduced in 4.6.1 Beta 1.TDIRegEx
.Format
and duplicate substring names.$(PRODUCT_NAME_VERSION) is mainly a bug-fix release:
TDIRegEx
.MatchNext
to match empty result strings. The new algorithm detects potential infinite loops and advances the search position as necessary.TDIRegEx
.SubStrPtr
.TDIRegEx
: InfoOkPartial
and InfoJChanged
.TDIRegEx
.CompiledRegExpArray
.coNewLineAnyCrLf
which is like coNewLineAny
, but matches only CR, LF, or CRLF as a newline sequence. The compile-option equivalent is moNewLineAnyCrLf
. Only a single newline option may be set at the same time. Invalid combinations of newline options will raise an exception.TDICustomRegExSearch
TDIRegExSearchStream
TDIRegExsEarchStream_Enc
TDIRegExSearchStream_ANSI
TDIRegExSearchStream_Binary
TDIRegExSearchStream_Binary16BE
TDIRegExSearchStream_Binary16LE
TDIRegExSearchStream_OEM
TDIRegExSearchStream_UTF16BE
TDIRegExSearchStream_UTF16LE
SubStrCount
returns the actual count of captured substrings, even for descendent classes. Fixed a problem where the wrong value was returned for TDIDfaRegEx
. Likewise improved the regular expression workbench.TDIRegExInspector
to handle Windows XP themes.List2
and Replace2
functions: They are different from the old List
and Replace
in that they return the number of matches listed / replaced and also work on empty matches. This can be usefull for replacing empty lines, for example.poUserLocale
Option if you are matching ANSI strings in the user's default locale.coNewLineAny
to recognize any of the Unicode newline sequences as “newline” when processing dot, circumflex, or dollar metacharacters, or #-comments in /x mode.