A new revision of UTS
#18, Unicode Regular Expressions is now available.
Regular expressions are a key tool in software development. Back
in 2000, few regular expression engines supported Unicode, even at
a basic level. UTS #18 set out to raise the bar, describing how
regular expression engines could be adapted to deal with Unicode
correctly and completely. Since that time, major programming
languages and libraries have adopted level 1 features (supporting
all Unicode literals, basic character properties, subtraction,
intersection, ...), and some also adopted some level 2 features
(full character properties, grapheme clusters, ...).
The main focus in this release is on handling the complement of
properties of strings. The distinction is drawn between code
point complement and full complement, followed by
explicitly defining the complement operator [^...] to be code
point complement, and providing the reasons for doing so in
an annex. The important difference between [A--B] and
[A&&[^B]] is outlined — setting out the reasons why the
latter is insufficient to represent set difference.
For the EBNF in general, and for character classes with strings in
particular, examples were added and the text clarified. A new
annex provides examples for how character classes can be parsed.
Over 144,000 characters are available for adoption
to help the Unicode Consortium’s work on digitally disadvantaged
languages
![[badge]](cid:part4.DC1E40C8.CF072DAE@unicode.org)