[Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D

glaubitz at physik dot fu-berlin.de
http://sourceware.org/bugzilla/show_bug.cgi?id=13063

           Summary: Can not 'sort -u' all Chinese characters in CJK
                    UNIFIED IDEOGRAPH EXTENSION A/B/C/D
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: critical
          Priority: P2
         Component: localedata
        AssignedTo: [hidden email]
        ReportedBy: [hidden email]


Hi,

Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or
iso14651_t1, glibc just support unicode3.0.

The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with
extension A/B/C/D, and extension A is included in GB18030:2005( China
locale charset standard).

So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH
and EXTENSIONA(U+3400-U+4DBF).

The real effect is sort -u.
If you execute sort -u examples_CJK_extensionA.txt (see attachment), you
will got only one Chinese character "㑗".


Regards,
An Yang

--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/13063] Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D

glaubitz at physik dot fu-berlin.de
http://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #1 from An Yang <an.euroford at gmail dot com> 2011-08-06 17:24:33 UTC ---
Created attachment 5880
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5880
example characters in CJK extension A.

--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/13063] 'sort -u' will erase some Chinese characters

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
http://sourceware.org/bugzilla/show_bug.cgi?id=13063

An Yang <an.euroford at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Can not 'sort -u' all       |'sort -u' will erase some
                   |Chinese characters in CJK   |Chinese characters
                   |UNIFIED IDEOGRAPH EXTENSION |
                   |A/B/C/D                     |

--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/13063] 'sort -u' will erase some Chinese characters

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
http://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #2 from An Yang <an.euroford at gmail dot com> 2011-08-07 17:42:44 UTC ---
I'm not sure, this bugs has any relationship with charmaps, maybe or may not.
But the value of LC_COLLATE in zh_CN is:

% ISO 14651 collation sequence
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

I'm sure, something is wrong in this table.

All the erased Chinese characters do not a record in iso14651_t1_pinyin, but
they are included in CJK unified Ideographs/ExtA/B/C/D.

--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/13063] 'sort -u' will erase some Chinese characters

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de
http://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #3 from An Yang <an.euroford at gmail dot com> 2011-08-08 16:54:28 UTC ---
There are 25496 Chinese characters in iso14651_t1_pinyin, most of them
distribute over CJK unified ideographs and CJK unified ideographs extension A.

But there are 27552 Chinese characters in CJK unified ideographs and extension
A, more than 2000 Chinese characters without pinyin were losted.

So my suggestion is just add the losted characters at the end of the
iso14651_t1_pinyin, in the order of unicode.

Could you give me any feedback?

--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.