[Bug localedata/14010] New: Serious omissions in alphabetic character class

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/14010] New: Serious omissions in alphabetic character class

cvs-commit at gcc dot gnu.org
http://sourceware.org/bugzilla/show_bug.cgi?id=14010

             Bug #: 14010
           Summary: Serious omissions in alphabetic character class
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: [hidden email]
        ReportedBy: [hidden email]
                CC: [hidden email]
    Classification: Unclassified


The localedata generation code defines is_alpha based on Unicode categories L*,
plus Nl, Nd, and a moderate number of special cases mostly to fix Thai language
support (to fix is_alpha returning false for letters in category Mn). However
Thai is not the only language affected; any language that uses non-spacing
letters is broken by glibc's deficient is_alpha definition. As a particular
example, all of the Tibetan subjoined letters are considered non-alphabetic
(and thus punctuation) by glibc.

Unicode addresses this issue by defining the Other_Alphabetic property in
PropList.txt and the Alphabetic derived property in DerivedCoreProperties.txt,
the latter of which consists of Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. This
subsumes all special-case hacks for Thai in glibc's gen-unicode-ctype.c and
fixes the issue (at least approximately) for all other languages/scripts at the
same time.

glibc's localedata should adopt the definition of Alphabetic from Unicode's
DerivedCoreProperties.txt (and still add Nd and the special cases from So).

--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/14010] Serious omissions in alphabetic character class

cvs-commit at gcc dot gnu.org


http://sourceware.org/bugzilla/show_bug.cgi?id=14010



--- Comment #1 from Rich Felker <bugdal at aerifal dot cx> 2012-09-21 23:07:28 UTC ---

Ping. Has anybody looked at this?



--

Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email

------- You are receiving this mail because: -------

You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/14010] Serious omissions in alphabetic character class

cvs-commit at gcc dot gnu.org
In reply to this post by cvs-commit at gcc dot gnu.org


http://sourceware.org/bugzilla/show_bug.cgi?id=14010



--- Comment #2 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2012-09-23 19:34:48 UTC ---

We know that there are over 500 open bugs and bugs are filed faster than

they are fixed.  Constructive responses on libc-alpha to

<http://sourceware.org/ml/libc-alpha/2012-08/msg00611.html> regarding how

to get more people actively fixing more bugs would be more useful, towards

the goal of getting down to maybe 100 bugs that are genuinely hard, than

pinging individual bugs (unless the ping is for something like reminding

someone to submit a patch or test whether a commit has fixed the bug for

them - where there is clear in-progress work that may have been forgotten

about).



There's plenty of room for an interested person to become glibc's

character set expert and address this bug, bug 14094 and bug 14095 (only

14095 is particularly likely to be hard) and probably other bugs as well.



--

Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email

------- You are receiving this mail because: -------

You are on the CC list for the bug.