Unicode update of width and other character properties

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode update of width and other character properties

Thomas Wolff
Hi,
this is a proposal to update wcwidth and the character properties
functions isw*/towupper/towlower to Unicode 10.0, as discussed in the
mail thread https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
as well as to simplify automatic generation of respective tables for an
easier update step.
Table size is moderate (using ranges for character properties) but there
is still an option to reduce the two big tables in size.

The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .

The Makefile.widthdata does not yet distinguish the two subdirectories
(libc/string, libc/ctypw) as it comes from a common development directory.

There is a test program in which comparison for isw*/tow* functions
between current and patched implementation can be compared.

I also provide a log of deviations of the new approach to the current
implementation, based on Unicode 5.2 data, to compare and check.
If there are any disputable cases, I would consider that of course.

My main aim was actually to get the wcwidth data updated, for which the
change is more obviously clear.

Thanks
Thomas




Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
On Aug  6 07:36, Thomas Wolff wrote:
> Hi,
> this is a proposal to update wcwidth and the character properties functions
> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
> as well as to simplify automatic generation of respective tables for an
> easier update step.
> Table size is moderate (using ranges for character properties) but there is
> still an option to reduce the two big tables in size.

As per the aforementioned discussion the table sizes are at least
twice as big, so this should be done with all due caution towards
the goals of smaller targets.

> The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .

That's not how it works.  Please create a git patch series and post
it here.

There's probably also a bit more to discuss before changing how this
works since it affects all targets using wide char functions.


Thanks,
Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff
Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:

> On Aug  6 07:36, Thomas Wolff wrote:
>> Hi,
>> this is a proposal to update wcwidth and the character properties functions
>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>> as well as to simplify automatic generation of respective tables for an
>> easier update step.
>> Table size is moderate (using ranges for character properties) but there is
>> still an option to reduce the two big tables in size.
> As per the aforementioned discussion the table sizes are at least
> twice as big, so this should be done with all due caution towards
> the goals of smaller targets.
If I'm going to implement the packed versions, they will be even smaller
than the current tables.

>> The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
> That's not how it works.  Please create a git patch series and post it here.
Any howto available, please? What's the git URL, how to produce the
desired patch format/series.
And then the patch would be included here by email?

Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
On Aug  7 21:18, Thomas Wolff wrote:

> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
> > On Aug  6 07:36, Thomas Wolff wrote:
> > > Hi,
> > > this is a proposal to update wcwidth and the character properties functions
> > > isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
> > > https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
> > > as well as to simplify automatic generation of respective tables for an
> > > easier update step.
> > > Table size is moderate (using ranges for character properties) but there is
> > > still an option to reduce the two big tables in size.
> > As per the aforementioned discussion the table sizes are at least
> > twice as big, so this should be done with all due caution towards
> > the goals of smaller targets.
> If I'm going to implement the packed versions, they will be even smaller
> than the current tables.
>
> > > The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
> > That's not how it works.  Please create a git patch series and post it here.
> Any howto available, please? What's the git URL,
  https://cygwin.com/git.html

> how to produce the desired patch format/series.

Just as with any other git-based project:

  $ git co -b my-stuff
  [hack, hack, hack]
  $ git commit [in useful chunks]
  $ git format-patch -X (X == number of commits)

> And then the patch would be included here by email?

Yes:

$ git send-email --to="[hidden email]"


Thanks,
Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff
Am 08.08.2017 um 10:30 schrieb Corinna Vinschen:

> On Aug  7 21:18, Thomas Wolff wrote:
>> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
>>> On Aug  6 07:36, Thomas Wolff wrote:
>>>> Hi,
>>>> this is a proposal to update wcwidth and the character properties functions
>>>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>>>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>>>> as well as to simplify automatic generation of respective tables for an
>>>> easier update step.
>>>> Table size is moderate (using ranges for character properties) but there is
>>>> still an option to reduce the two big tables in size.
>>> As per the aforementioned discussion the table sizes are at least
>>> twice as big, so this should be done with all due caution towards
>>> the goals of smaller targets.
>> If I'm going to implement the packed versions, they will be even smaller
>> than the current tables.
>>
>> ...
>> how to produce the desired patch format/series.
> Just as with any other git-based project:
>
>    $ git co -b my-stuff
>    [hack, hack, hack]
>    $ git commit [in useful chunks]
>    $ git format-patch -X (X == number of commits)
>
>> And then the patch would be included here by email?
> Yes:
>
> $ git send-email --to="[hidden email]"
I'm attaching my patches here for assessment.
I have revised table handling further, using gcc bit struct packing. The
two big tables have a total size of 14340 bytes now, for Unicode 10.0.
I have fixed locale handling in the isw* and tow* functions, but I've
not yet changed JP conversion. Unfortunately, the routines from
newlib/iconvdata are not as straight-forward to be employed as I
thought, because the work on multi-byte representations.
Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
handled in iconvdata (JIS-201/208/212) is a little bit obscure.
Likewise obscure is the relation between newlib/iconvdata and
newlib/libc/iconv.
To be on the safe side, I’m leaving the actual jp2uc conversion
untouched for now, and I’ve just added a dummy back-conversion uc2jp
with a #warning. If the #warning is ignored or removed, the non-Cygwin
build should work as before, fixing just locale handling.

I'm attaching the wcwidth part here, all patches are available at
http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
mailbox size limit).
Thomas


0001-creation-of-width-data-supporting-Unicode-updates.patch (36K) Download Attachment
0002-generated-width-data-included-in-repository-because-.patch (29K) Download Attachment
0003-use-generated-width-data.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Ping: Unicode update of width and other character properties

Thomas Wolff
In reply to this post by Thomas Wolff
Hi,
this is to remind of may patch for wcwidth Unicode consistence, as
requested.
Thomas


-------- Weitergeleitete Nachricht --------
Betreff: Unicode update of width and other character properties
Datum: Sun, 6 Aug 2017 07:36:10 +0200
Von: Thomas Wolff <[hidden email]>
An: [hidden email]



Hi,
this is a proposal to update wcwidth and the character properties
functions isw*/towupper/towlower to Unicode 10.0, as discussed in the
mail thread https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
as well as to simplify automatic generation of respective tables for an
easier update step.
Table size is moderate (using ranges for character properties) but there
is still an option to reduce the two big tables in size.

The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .

The Makefile.widthdata does not yet distinguish the two subdirectories
(libc/string, libc/ctypw) as it comes from a common development directory.

There is a test program in which comparison for isw*/tow* functions
between current and patched implementation can be compared.

I also provide a log of deviations of the new approach to the current
implementation, based on Unicode 5.2 data, to compare and check.
If there are any disputable cases, I would consider that of course.

My main aim was actually to get the wcwidth data updated, for which the
change is more obviously clear.

Thanks
Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
In reply to this post by Thomas Wolff
Sorry for the late reply, I forgot this patch.

On Aug 17 07:53, Thomas Wolff wrote:

> [...]
> I'm attaching my patches here for assessment.
> I have revised table handling further, using gcc bit struct packing. The two
> big tables have a total size of 14340 bytes now, for Unicode 10.0.
> I have fixed locale handling in the isw* and tow* functions, but I've not
> yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
> are not as straight-forward to be employed as I thought, because the work on
> multi-byte representations.
> Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
> handled in iconvdata (JIS-201/208/212) is a little bit obscure.
> Likewise obscure is the relation between newlib/iconvdata and
> newlib/libc/iconv.
This is really old stuff.  I wonder if anybody is still using it with
Unicode around for a long time...

> To be on the safe side, I’m leaving the actual jp2uc conversion untouched
> for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
> If the #warning is ignored or removed, the non-Cygwin build should work as
> before, fixing just locale handling.
>
> I'm attaching the wcwidth part here, all patches are available at
> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
> mailbox size limit).

So why don't you use git send-email (ideally with a cover letter, see
`git format-patch --cover-letter') instead of attaching the patches to a
single email?  This is the correct way of sending patch series and it
gets you around the size limit.

The below patches are missing a patch, last one is patch 3/4.

Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
an extra Makefile?  This should be merged into string/Makefile.am.


Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff

> Sorry for the late reply, I forgot this patch.
>
> On Aug 17 07:53, Thomas Wolff wrote:
>> [...]
>> I'm attaching my patches here for assessment.
>> I have revised table handling further, using gcc bit struct packing. The two
>> big tables have a total size of 14340 bytes now, for Unicode 10.0.
>> I have fixed locale handling in the isw* and tow* functions, but I've not
>> yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
>> are not as straight-forward to be employed as I thought, because the work on
>> multi-byte representations.
>> Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
>> handled in iconvdata (JIS-201/208/212) is a little bit obscure.
>> Likewise obscure is the relation between newlib/iconvdata and
>> newlib/libc/iconv.
> This is really old stuff.  I wonder if anybody is still using it with
> Unicode around for a long time...
>
>> To be on the safe side, I’m leaving the actual jp2uc conversion untouched
>> for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
>> If the #warning is ignored or removed, the non-Cygwin build should work as
>> before, fixing just locale handling.
>>
>> I'm attaching the wcwidth part here, all patches are available at
>> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
>> mailbox size limit).
> So why don't you use git send-email (ideally with a cover letter, see
> `git format-patch --cover-letter') instead of attaching the patches to a
> single email?  This is the correct way of sending patch series and it
> gets you around the size limit.
Because of:
LC_ALL=C git send-email
git: 'send-email' is not a git command. See 'git --help'.

Are there any working instructions for newlib contributions to be found
anywhere?

> The below patches are missing a patch, last one is patch 3/4.
>
> Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
> an extra Makefile?  This should be merged into string/Makefile.am.
I was awaiting feedback before doing further integration work. The
Makefile includes test stuff and also the table generation. Would the
generation be invoked every time or rather called manually?

Thomas

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Jon TURNEY
On 03/12/2017 14:07, Thomas Wolff wrote:
>> On Aug 17 07:53, Thomas Wolff wrote:

>>> I'm attaching the wcwidth part here, all patches are available at
>>> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
>>> mailbox size limit).
>> So why don't you use git send-email (ideally with a cover letter, see
>> `git format-patch --cover-letter') instead of attaching the patches to a
>> single email?  This is the correct way of sending patch series and it
>> gets you around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
>
> Are there any working instructions for newlib contributions to be found
> anywhere?

Due to extra deps, git-send-email is usually packaged separately.

On Cygwin, the package name is 'git-email' ('Email tools for Git')

Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Brian Inglis
In reply to this post by Thomas Wolff
On 2017-12-03 07:07, Thomas Wolff wrote:
>> On Aug 17 07:53, Thomas Wolff wrote:
>> So why don't you use git send-email (ideally with a cover letter, see `git
>> format-patch --cover-letter') instead of attaching the patches to a single
>> email?  This is the correct way of sending patch series and it gets you
>> around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.

You need to install git-email, and if you run X you might also want git-gui.

Ensure $GIT_EDITOR|$VISUAL|$EDITOR stays in foreground so that you can edit
commit messages, emails, interactive rebases, merges, etc. e.g.:
        git config --global core.editor 'gvim -f'

You should not need LC_ALL=C most places these days, except to get sort, join,
uniq to play well together.

> Are there any working instructions for newlib contributions to be found
> anywhere?

Everyone assumes you are comfortable with git and understand its model.
Advice I git:

cd  .../repo

git checkout master
git pull
git checkout -b BRANCH

$VISUAL FILE
git add FILE
git commit FILE
...

git format-patch -o PATH/ --stat --cover-letter -#commits
$VISUAL PATH/0000-*cover-letter.patch
git send-email --compose PATH/000?-*.patch

Update branch to the latest upstream master:

git checkout master
git pull
git rebase [-i] master BRANCH

...

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
In reply to this post by Thomas Wolff
On Dec  3 15:07, Thomas Wolff wrote:

> > On Aug 17 07:53, Thomas Wolff wrote:
> > > [...]
> > > I have fixed locale handling in the isw* and tow* functions, but I've not
> > > yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
> > > are not as straight-forward to be employed as I thought, because the work on
> > > multi-byte representations.
> > > Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
> > > handled in iconvdata (JIS-201/208/212) is a little bit obscure.
> > > Likewise obscure is the relation between newlib/iconvdata and
> > > newlib/libc/iconv.
> > This is really old stuff.  I wonder if anybody is still using it with
> > Unicode around for a long time...
I forgot to mention, I think your approach to keep this is the best
one for now so as not to break anything for small targets.

> > > To be on the safe side, I’m leaving the actual jp2uc conversion untouched
> > > for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
> > > If the #warning is ignored or removed, the non-Cygwin build should work as
> > > before, fixing just locale handling.
> > >
> > > I'm attaching the wcwidth part here, all patches are available at
> > > http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
> > > mailbox size limit).
> > So why don't you use git send-email (ideally with a cover letter, see
> > `git format-patch --cover-letter') instead of attaching the patches to a
> > single email?  This is the correct way of sending patch series and it
> > gets you around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
>
> Are there any working instructions for newlib contributions to be found
> anywhere?
Jon and Brian answered that.

> > The below patches are missing a patch, last one is patch 3/4.
> >
> > Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
> > an extra Makefile?  This should be merged into string/Makefile.am.
> I was awaiting feedback before doing further integration work. The Makefile
> includes test stuff and also the table generation. Would the generation be
> invoked every time or rather called manually?

Keeping generated files in the repos is frowned upon these days, but
we're doing this for a pretty long time already and don't want everybody
having to do these, mostly awkward steps.  So, yeah, keeping the tables
in the repo and manually calling the generation targets sounds right to
me.  Only maintainers (or interested parties) need to do this once in a
while.


Thanks,
Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment