[Bug localedata/4024] New: collation in pinyin for zh_CN locale

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] New: collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
Currently, zh_CN is just copying the iso14651 collation, which is incorrect, as
the most frequently used collation should be pinyin.

Collation modified follows.

--
           Summary: collation in pinyin for zh_CN locale
           Product: glibc
           Version: 2.4
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales at sources dot redhat dot com
        ReportedBy: fundawang at gmail dot com
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de

------- Additional Comments From fundawang at gmail dot com  2007-02-11 11:27 -------
Created an attachment (id=1547)
 --> (http://sourceware.org/bugzilla/attachment.cgi?id=1547&action=view)
zh_CN@pinyin.gb18030 collate

The attachment is the collation for zh_CN locale. The method for generating
this piece of data could be found at
http://download.gro.clinux.org/fedora/locale-pinyin-0.1.tar.gz

The pinyin_table.txt inside that package is from scim (GPLed IME).
And the gen_pinyin.pl from that package is authored by
[hidden email], which is GPLed also.


--


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de

------- Additional Comments From drepper at redhat dot com  2007-02-17 07:35 -------
So, what exactly is the proposal?  Create a new locale zh_CN@pinyin or using the
new collation data for zh_CN?  The former sounds much safer to me.

--
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |WAITING


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de

------- Additional Comments From fundawang at gmail dot com  2007-02-17 07:44 -------
> So, what exactly is the proposal?  Create a new locale zh_CN@pinyin
> or using the new collation data for zh_CN?  The former sounds much
> safer to me.
There will be several collation for Chinese, like pronunciation (pinyin) and
strokes. The most widely used collation is pinyin acturally. The collation of
iso14651 is of no use for Chinese.

So, the proposal is replacing current collation for zh_CN (iso14651) to pinyin.
As for the strokes, we'll likely propose zh_CN@strokes in the future.

--


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

Re: [Bug localedata/4024] collation in pinyin for zh_CN locale

Ed Trager
Pinyin collation for zh_CN as the default will be great -- In fact, I
am surprised to learn it isn't done that way already!

I do have a question though:  What is the order of characters nested
within a given pinyin+tone category?  For example, is this going to
follow the standard order of one of the big dictionaries?  Or
something else?

My copy of "The Pinyin Chinese-English Dictionary 汉英词典" (Wu Jingrong
ed., Beijing Foreign Languages Institute 1979) seems to order
characters by number of strokes within pinyin category, i.e. "jin1":
巾今斤金津矜 ... etc.  This is one logical way to do it.

But my copy of 现代汉语词典 (中国社会科学院语言研究所 Commercial Press Beijing 1986)
orders "jin1" completely differently: 津禁襟巾今衿矜 ... etc.  I'm not sure
what the logic here is ...

Another logical way to do it would be to order by how frequently the
character is used.  If I remember correctly from an earlier post, the
perl script for generating the locale were pulling data from SCIM
tables.  So does this mean you are going to order based on character
usage frequency within pinyin+tone category?

Best - Ed

On 17 Feb 2007 07:44:43 -0000, fundawang at gmail dot com
<[hidden email]> wrote:

>
> ------- Additional Comments From fundawang at gmail dot com  2007-02-17 07:44 -------
> > So, what exactly is the proposal?  Create a new locale zh_CN@pinyin
> > or using the new collation data for zh_CN?  The former sounds much
> > safer to me.
> There will be several collation for Chinese, like pronunciation (pinyin) and
> strokes. The most widely used collation is pinyin acturally. The collation of
> iso14651 is of no use for Chinese.
>
> So, the proposal is replacing current collation for zh_CN (iso14651) to pinyin.
> As for the strokes, we'll likely propose zh_CN@strokes in the future.
>
> --
>
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=4024
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de

------- Additional Comments From ed dot trager at gmail dot com  2007-02-17 15:24 -------
Subject: Re:  collation in pinyin for zh_CN locale

Pinyin collation for zh_CN as the default will be great -- In fact, I
am surprised to learn it isn't done that way already!

I do have a question though:  What is the order of characters nested
within a given pinyin+tone category?  For example, is this going to
follow the standard order of one of the big dictionaries?  Or
something else?

My copy of "The Pinyin Chinese-English Dictionary 汉英词典" (Wu Jingrong
ed., Beijing Foreign Languages Institute 1979) seems to order
characters by number of strokes within pinyin category, i.e. "jin1":
巾今斤金津矜 ... etc.  This is one logical way to do it.

But my copy of 现代汉语词典 (中国社会科学院语言研究所 Commercial Press Beijing 1986)
orders "jin1" completely differently: 津禁襟巾今衿矜 ... etc.  I'm not sure
what the logic here is ...

Another logical way to do it would be to order by how frequently the
character is used.  If I remember correctly from an earlier post, the
perl script for generating the locale were pulling data from SCIM
tables.  So does this mean you are going to order based on character
usage frequency within pinyin+tone category?

Best - Ed

On 17 Feb 2007 07:44:43 -0000, fundawang at gmail dot com
<[hidden email]> wrote:

>
> ------- Additional Comments From fundawang at gmail dot com  2007-02-17 07:44 -------
> > So, what exactly is the proposal?  Create a new locale zh_CN@pinyin
> > or using the new collation data for zh_CN?  The former sounds much
> > safer to me.
> There will be several collation for Chinese, like pronunciation (pinyin) and
> strokes. The most widely used collation is pinyin acturally. The collation of
> iso14651 is of no use for Chinese.
>
> So, the proposal is replacing current collation for zh_CN (iso14651) to pinyin.
> As for the strokes, we'll likely propose zh_CN@strokes in the future.
>
> --
>
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=4024
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>


--


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de

------- Additional Comments From fundawang at gmail dot com  2007-02-17 16:46 -------
> So does this mean you are going to order based on character
> usage frequency within pinyin+tone category?
Correctly.

In fact, most Chinese users only care about the characters that have same
pronunciations should be sorted together, the inner order of pinyin+tone is
not so that important.

--


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de

------- Additional Comments From fundawang at gmail dot com  2007-03-23 06:11 -------
How is this bug going?

--


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/4024] collation in pinyin for zh_CN locale

glaubitz at physik dot fu-berlin.de
In reply to this post by glaubitz at physik dot fu-berlin.de

------- Additional Comments From drepper at redhat dot com  2007-04-28 07:52 -------
You didn't change the state of the bug.  So don't complain if nobody notices it.

I have checked in a different patch.  Your collation file duplicates a lot of
information from the normal iso14651_t1 file.  That's no good.  The setup I have
now allows sharing the data.

--
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |RESOLVED
         Resolution|                            |FIXED


http://sourceware.org/bugzilla/show_bug.cgi?id=4024

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.