[Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Bug regex/23393] Handle [a-z] and [A-Z] in consistent portable fashion regardless of locale.

glaubitz at physik dot fu-berlin.de
https://sourceware.org/bugzilla/show_bug.cgi?id=23393

Eric Blake <eblake at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eblake at redhat dot com

--- Comment #30 from Eric Blake <eblake at redhat dot com> ---
(In reply to Florian Weimer from comment #12)

> I find it very dubious that the current implementation of ranges is useful
> for anything at all, exception implementation convenience (as it's what we
> have today).
>
> Two possible improvements come to my mind:
>
> (a) If the both ranges are ASCII, match only ASCII characters.
>
> (b) Ranges include all characters with the same primary collation weight as
> the endpoints.
>
> It's possible to implement both, with (a) superseding (b).  I'm not sure if
> today, range expressions can match collating elements consisting of multiple
> characters, in which case the following variant might be less surprising:
>
> (b') Ranges include all collating elements with the same primary weight as
> the endpoints.
>
> Both approaches are conforming to POSIX because ranges in other locales are
> undefined anyway.  As far as I can see, available user feedback suggests
> that (a) is the expected behavior.

Well, close to (a), at any rate.  You're looking for Rational Range
Interpretation, which has been picked up by several GNU tools already (awk,
coreutils, sed, bash, ...)

>
> I think some tools actually implement (a) already because we went through
> this fifteen years ago or something like that, but I can't find the historic
> discussion.

GNU awk has a whole section on this:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

http://austingroupbugs.net/view.php?id=1078 and nearby discussion on the Austin
Group list may also be relevant to the discussion (namely, that if [[:digit:]]
corresponds to the locale and ctype definition of isdigit(), it MUST be exactly
10 characters in ALL locales, and not add any non-ASCII-but-Unicode-digits,
because doing that violates the principle of least surprise.

--
You are receiving this mail because:
You are on the CC list for the bug.