[Bug localedata/26120] New: column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] New: column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

            Bug ID: 26120
           Summary: column width of  of some Korean JUNGSEONG/JONGSEONG
                    characters wrong (should be 0)
           Product: glibc
           Version: 2.31
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: maiku.fabian at gmail dot com
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Robert Ross <[hidden email]> writes:

> Thank you for maintaining glibc's "localedata/charmaps/UTF-8".  It is
> good that most "HANGUL JUNGSEONG" characters have zero width due to
> "<U1160>...<U11FF> 0" on line 48775 but strange that the newer "HANGUL
> JUNGSEONG" characters have width 1 since there is no
> "<UD7B0>...<UD7C6> 0".  Similarly most "HANGUL JONGSEONG" characters
> have width 0 due to line 48775 but the newer ones have width 1 since
> there is no "<UD7CB>...<UD7FB> 0".  Please correct this if it's an
> error or explain if it's not.

In https://www.unicode.org/Public/13.0.0/ucd/EastAsianWidth.txt all of these
have width "N".

http://www.unicode.org/reports/tr11/ says:

6.2 Combining Marks

> Combining marks have been classified and are given a property
> assignment based on their typical applicability. For example,
> combining marks typically applied to characters of class N, Na, or W
> are classified as A. Combining marks for purely non-East Asian scripts
> are marked as N, and nonspacing marks used only with wide characters
> are given a W. Even more so than for other characters, the
> East_Asian_Width property for combining marks is not the same as their
> display width.
>
> In particular, nonspacing marks do not possess actual advance
> width. Therefore, even when displaying combining marks, the
> East_Asian_Width property cannot be related to the advance width of
> these characters. However, it can be useful in determining the
> encoding length in a legacy encoding, or the choice of font for the
> range of characters including that nonspacing mark. The width of the
> glyph image of a nonspacing mark should always be chosen as the
> appropriate one for the width of the base character.

See also: https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c5

> We also agree that the Hangul Jamo U+1160‥U+11FF are sort
> of "combining characters" although they are not marked as such
> in the Unicode data. But they are fragments of Hangul characters
> which combine. So it seems correct to mark them as width 0.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tg at mirbsd dot de

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |egmont at gmail dot com

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rob.ross at ymail dot com

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #1 from Mike FABIAN <maiku.fabian at gmail dot com> ---
So I think it is best to set all JUNGSEONG/JONGSEONG characters to width 0.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at sourceware dot org   |maiku.fabian at gmail dot com

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #2 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Some information from a chat with Thorsten Glaser (in German):

<mfabian> Alles was  JUNGSEONG oder JONGSEONG im Namen hat, ist so ein
Combining
          Character? [20年06月15日 21:38:17]
<MirWarm> soweit ich das verstanden habe, sind die koreanischen zeichen immer
          choseong + j{u,o}ngseong [20年06月15日 21:54:15]
<MirWarm>  The Hangul jamo are divided into three classes: choseong (Leading
          consonants), jungseong (Vowels) and jongseong (Trailing consonants)
          which in the rest of this write-up will be referred to as L, V and T.
                                                          [20年06月15日 21:58:54]
<MirWarm> A standard Hangul syllable is composed as (L+V+T*)
                                                          [20年06月15日 21:58:55]
<MirWarm> ah, ja [20年06月15日 21:58:57]
<MirWarm> also die choseong sind wohl nicht required im koreanischen Skript,
          aber in Unicode wohl, man muß dann mit U+115F anfangen
                                                          [20年06月15日 21:59:24]
<MirWarm> choseong ist initial (C), jungseong ist medial (G) und nucleus (V),
          jongseong ist coda (K) [20年06月15日 22:00:15]
<MirWarm> und koreanische silbenwörter sind (C)(G)V(K) [20年06月15日 22:00:27]
<MirWarm> und in Unicode nimmt man U+115F, wenn C fehlt [20年06月15日 22:00:53]
<MirWarm> 115F ist 1, die anderen sind 0 [20年06月15日 22:01:06]
<MirWarm> paßt [20年06月15日 22:01:07]
<MirWarm> bin in ~5 minuten wieder da [20年06月15日 22:01:14]
*** MirWarm (~[hidden email])
has
    quit: Quit: using sirc version 2.211-MirDebian-20181124-1+ssfe
(RANDOM=2406)
                                                          [20年06月15日 22:01:15]
*** MirWarm (~[hidden email]) has joined channel #mirbsd
                                                          [20年06月15日 22:06:44]
<MirWarm> re [20年06月15日 22:07:05]
<MirWarm> ich mach bei mir dann gleich mal D7B0 .. D7FF noch auf 0
                                                          [20年06月15日 22:08:33]
<MirWarm> so, committed [20年06月15日 22:31:19]

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 12623
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12623&action=edit
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-
                 CC|                            |fweimer at redhat dot com

--- Comment #4 from Florian Weimer <fweimer at redhat dot com> ---
Does gnulib need updating as well?

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #5 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Florian Weimer from comment #4)
> Does gnulib need updating as well?

I don’t know. Does gnulib have width data?

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #6 from Florian Weimer <fweimer at redhat dot com> ---
Yes, I think it's here:

http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/width.c;h=c760ad33183418a8f103152ff43d57fabbc3949d;hb=HEAD

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #7 from Thorsten Glaser <tg at mirbsd dot de> ---
Erk… glibc is particular about not defining widths of not-defined characters.

Besides D7FC‥D7FF (which gave me an error in the output from my own scripts),
D7C7‥D7CA are not yet assigned and so probably need to be excluded in glibc.

Should they ever be defined, we’ll need to adjust here, so it’s probably better
to iterate over the entire D7C0‥D7FF range and ony change widths for defined
codepoints from the current UCD version.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #8 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 12629
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12629&action=edit
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

Updated patch to ommit the unassigned characters.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #9 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Thorsten Glaser from comment #7)
> Erk… glibc is particular about not defining widths of not-defined characters.
>
> Besides D7FC‥D7FF (which gave me an error in the output from my own
> scripts), D7C7‥D7CA are not yet assigned and so probably need to be excluded
> in glibc.
>
> Should they ever be defined, we’ll need to adjust here, so it’s probably
> better to iterate over the entire D7C0‥D7FF range and ony change widths for
> defined codepoints from the current UCD version.

Thank you for noticing that!

I was aware that glibc has a problem with defining width of unassigned
characters, therefore I used

 for key in list(range(0xD7B0, 0xD7FC)):

instead of

 for key in list(range(0xD7B0, 0xD800)):

because D7FC and D7FF are undefined and localedef gave me errors
when I included them. Surprisingly localedef did not give  errors for the
unassigned D7C7‥D7CA ...

I had checked the range manually and thought all characters
from D7B0 to D7FB were assigned, but apparently I missed D7C7‥D7CA.

I improved the generator script a bit to omit the unassigned characters,
if these get defined in future, the script would add them.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #10 from Thorsten Glaser <tg at mirbsd dot de> ---
Looks okay (but now you can use 0xD800 in the range call), this is similar to
what I did in my script
http://www.mirbsd.org/cvs.cgi/contrib/code/Snippets/eaw2glibc that
postprocesses the width output I normally use (script
http://www.mirbsd.org/cvs.cgi/contrib/code/Snippets/eawparse and
http://www.mirbsd.org/cvs.cgi/X11/xc/programs/xterm/wcwidth.c?rev=HEAD contains
an example of its output) into glibc-compatible format.

The output I get (for UCD 13.0.0) is identical to yours.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #11 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Thorsten Glaser from comment #10)
> Looks okay (but now you can use 0xD800 in the range call),

Yes, I could. But if 0xD7FE and 0xD7FF ever get assigned,
would they be characters of the same type? I would have to check
that manually anyway.

> The output I get (for UCD 13.0.0) is identical to yours.

Great!

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #12 from Thorsten Glaser <tg at mirbsd dot de> ---
According to Blocks.txt, yes. Unicode does assign characters to blocks.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

--- Comment #13 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Thorsten Glaser from comment #12)
> According to Blocks.txt, yes. Unicode does assign characters to blocks.

D7B0..D7FF; Hangul Jamo Extended-B

I think you are right, I’ll change the script to end the range at the end of
that block, that seems more likely to be correct if these characters ever get
assigned.

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #12623|0                           |1
        is obsolete|                            |
  Attachment #12629|0                           |1
        is obsolete|                            |

--- Comment #14 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 12651
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12651&action=edit
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

End the range at 0xD7FF

--
You are receiving this mail because:
You are on the CC list for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug localedata/26120] column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Sourceware - glibc-bugs mailing list
In reply to this post by Sourceware - glibc-bugs mailing list
https://sourceware.org/bugzilla/show_bug.cgi?id=26120

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #12651|0                           |1
        is obsolete|                            |

--- Comment #15 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 12661
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12661&action=edit
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

Use "make install" instead of only changing the UTF-8 file.

--
You are receiving this mail because:
You are on the CC list for the bug.
12