Unicode update of width and other character properties

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode update of width and other character properties

Thomas Wolff
Hi,
this is a proposal to update wcwidth and the character properties
functions isw*/towupper/towlower to Unicode 10.0, as discussed in the
mail thread https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
as well as to simplify automatic generation of respective tables for an
easier update step.
Table size is moderate (using ranges for character properties) but there
is still an option to reduce the two big tables in size.

The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .

The Makefile.widthdata does not yet distinguish the two subdirectories
(libc/string, libc/ctypw) as it comes from a common development directory.

There is a test program in which comparison for isw*/tow* functions
between current and patched implementation can be compared.

I also provide a log of deviations of the new approach to the current
implementation, based on Unicode 5.2 data, to compare and check.
If there are any disputable cases, I would consider that of course.

My main aim was actually to get the wcwidth data updated, for which the
change is more obviously clear.

Thanks
Thomas




Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
On Aug  6 07:36, Thomas Wolff wrote:
> Hi,
> this is a proposal to update wcwidth and the character properties functions
> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
> as well as to simplify automatic generation of respective tables for an
> easier update step.
> Table size is moderate (using ranges for character properties) but there is
> still an option to reduce the two big tables in size.

As per the aforementioned discussion the table sizes are at least
twice as big, so this should be done with all due caution towards
the goals of smaller targets.

> The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .

That's not how it works.  Please create a git patch series and post
it here.

There's probably also a bit more to discuss before changing how this
works since it affects all targets using wide char functions.


Thanks,
Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff
Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:

> On Aug  6 07:36, Thomas Wolff wrote:
>> Hi,
>> this is a proposal to update wcwidth and the character properties functions
>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>> as well as to simplify automatic generation of respective tables for an
>> easier update step.
>> Table size is moderate (using ranges for character properties) but there is
>> still an option to reduce the two big tables in size.
> As per the aforementioned discussion the table sizes are at least
> twice as big, so this should be done with all due caution towards
> the goals of smaller targets.
If I'm going to implement the packed versions, they will be even smaller
than the current tables.

>> The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
> That's not how it works.  Please create a git patch series and post it here.
Any howto available, please? What's the git URL, how to produce the
desired patch format/series.
And then the patch would be included here by email?

Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
On Aug  7 21:18, Thomas Wolff wrote:

> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
> > On Aug  6 07:36, Thomas Wolff wrote:
> > > Hi,
> > > this is a proposal to update wcwidth and the character properties functions
> > > isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
> > > https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
> > > as well as to simplify automatic generation of respective tables for an
> > > easier update step.
> > > Table size is moderate (using ranges for character properties) but there is
> > > still an option to reduce the two big tables in size.
> > As per the aforementioned discussion the table sizes are at least
> > twice as big, so this should be done with all due caution towards
> > the goals of smaller targets.
> If I'm going to implement the packed versions, they will be even smaller
> than the current tables.
>
> > > The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
> > That's not how it works.  Please create a git patch series and post it here.
> Any howto available, please? What's the git URL,
  https://cygwin.com/git.html

> how to produce the desired patch format/series.

Just as with any other git-based project:

  $ git co -b my-stuff
  [hack, hack, hack]
  $ git commit [in useful chunks]
  $ git format-patch -X (X == number of commits)

> And then the patch would be included here by email?

Yes:

$ git send-email --to="[hidden email]"


Thanks,
Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff
Am 08.08.2017 um 10:30 schrieb Corinna Vinschen:

> On Aug  7 21:18, Thomas Wolff wrote:
>> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
>>> On Aug  6 07:36, Thomas Wolff wrote:
>>>> Hi,
>>>> this is a proposal to update wcwidth and the character properties functions
>>>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>>>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>>>> as well as to simplify automatic generation of respective tables for an
>>>> easier update step.
>>>> Table size is moderate (using ranges for character properties) but there is
>>>> still an option to reduce the two big tables in size.
>>> As per the aforementioned discussion the table sizes are at least
>>> twice as big, so this should be done with all due caution towards
>>> the goals of smaller targets.
>> If I'm going to implement the packed versions, they will be even smaller
>> than the current tables.
>>
>> ...
>> how to produce the desired patch format/series.
> Just as with any other git-based project:
>
>    $ git co -b my-stuff
>    [hack, hack, hack]
>    $ git commit [in useful chunks]
>    $ git format-patch -X (X == number of commits)
>
>> And then the patch would be included here by email?
> Yes:
>
> $ git send-email --to="[hidden email]"
I'm attaching my patches here for assessment.
I have revised table handling further, using gcc bit struct packing. The
two big tables have a total size of 14340 bytes now, for Unicode 10.0.
I have fixed locale handling in the isw* and tow* functions, but I've
not yet changed JP conversion. Unfortunately, the routines from
newlib/iconvdata are not as straight-forward to be employed as I
thought, because the work on multi-byte representations.
Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
handled in iconvdata (JIS-201/208/212) is a little bit obscure.
Likewise obscure is the relation between newlib/iconvdata and
newlib/libc/iconv.
To be on the safe side, I’m leaving the actual jp2uc conversion
untouched for now, and I’ve just added a dummy back-conversion uc2jp
with a #warning. If the #warning is ignored or removed, the non-Cygwin
build should work as before, fixing just locale handling.

I'm attaching the wcwidth part here, all patches are available at
http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
mailbox size limit).
Thomas


0001-creation-of-width-data-supporting-Unicode-updates.patch (36K) Download Attachment
0002-generated-width-data-included-in-repository-because-.patch (29K) Download Attachment
0003-use-generated-width-data.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Ping: Unicode update of width and other character properties

Thomas Wolff
In reply to this post by Thomas Wolff
Hi,
this is to remind of may patch for wcwidth Unicode consistence, as
requested.
Thomas


-------- Weitergeleitete Nachricht --------
Betreff: Unicode update of width and other character properties
Datum: Sun, 6 Aug 2017 07:36:10 +0200
Von: Thomas Wolff <[hidden email]>
An: [hidden email]



Hi,
this is a proposal to update wcwidth and the character properties
functions isw*/towupper/towlower to Unicode 10.0, as discussed in the
mail thread https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
as well as to simplify automatic generation of respective tables for an
easier update step.
Table size is moderate (using ranges for character properties) but there
is still an option to reduce the two big tables in size.

The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .

The Makefile.widthdata does not yet distinguish the two subdirectories
(libc/string, libc/ctypw) as it comes from a common development directory.

There is a test program in which comparison for isw*/tow* functions
between current and patched implementation can be compared.

I also provide a log of deviations of the new approach to the current
implementation, based on Unicode 5.2 data, to compare and check.
If there are any disputable cases, I would consider that of course.

My main aim was actually to get the wcwidth data updated, for which the
change is more obviously clear.

Thanks
Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
In reply to this post by Thomas Wolff
Sorry for the late reply, I forgot this patch.

On Aug 17 07:53, Thomas Wolff wrote:

> [...]
> I'm attaching my patches here for assessment.
> I have revised table handling further, using gcc bit struct packing. The two
> big tables have a total size of 14340 bytes now, for Unicode 10.0.
> I have fixed locale handling in the isw* and tow* functions, but I've not
> yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
> are not as straight-forward to be employed as I thought, because the work on
> multi-byte representations.
> Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
> handled in iconvdata (JIS-201/208/212) is a little bit obscure.
> Likewise obscure is the relation between newlib/iconvdata and
> newlib/libc/iconv.
This is really old stuff.  I wonder if anybody is still using it with
Unicode around for a long time...

> To be on the safe side, I’m leaving the actual jp2uc conversion untouched
> for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
> If the #warning is ignored or removed, the non-Cygwin build should work as
> before, fixing just locale handling.
>
> I'm attaching the wcwidth part here, all patches are available at
> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
> mailbox size limit).

So why don't you use git send-email (ideally with a cover letter, see
`git format-patch --cover-letter') instead of attaching the patches to a
single email?  This is the correct way of sending patch series and it
gets you around the size limit.

The below patches are missing a patch, last one is patch 3/4.

Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
an extra Makefile?  This should be merged into string/Makefile.am.


Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff

> Sorry for the late reply, I forgot this patch.
>
> On Aug 17 07:53, Thomas Wolff wrote:
>> [...]
>> I'm attaching my patches here for assessment.
>> I have revised table handling further, using gcc bit struct packing. The two
>> big tables have a total size of 14340 bytes now, for Unicode 10.0.
>> I have fixed locale handling in the isw* and tow* functions, but I've not
>> yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
>> are not as straight-forward to be employed as I thought, because the work on
>> multi-byte representations.
>> Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
>> handled in iconvdata (JIS-201/208/212) is a little bit obscure.
>> Likewise obscure is the relation between newlib/iconvdata and
>> newlib/libc/iconv.
> This is really old stuff.  I wonder if anybody is still using it with
> Unicode around for a long time...
>
>> To be on the safe side, I’m leaving the actual jp2uc conversion untouched
>> for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
>> If the #warning is ignored or removed, the non-Cygwin build should work as
>> before, fixing just locale handling.
>>
>> I'm attaching the wcwidth part here, all patches are available at
>> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
>> mailbox size limit).
> So why don't you use git send-email (ideally with a cover letter, see
> `git format-patch --cover-letter') instead of attaching the patches to a
> single email?  This is the correct way of sending patch series and it
> gets you around the size limit.
Because of:
LC_ALL=C git send-email
git: 'send-email' is not a git command. See 'git --help'.

Are there any working instructions for newlib contributions to be found
anywhere?

> The below patches are missing a patch, last one is patch 3/4.
>
> Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
> an extra Makefile?  This should be merged into string/Makefile.am.
I was awaiting feedback before doing further integration work. The
Makefile includes test stuff and also the table generation. Would the
generation be invoked every time or rather called manually?

Thomas

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Jon TURNEY
On 03/12/2017 14:07, Thomas Wolff wrote:
>> On Aug 17 07:53, Thomas Wolff wrote:

>>> I'm attaching the wcwidth part here, all patches are available at
>>> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
>>> mailbox size limit).
>> So why don't you use git send-email (ideally with a cover letter, see
>> `git format-patch --cover-letter') instead of attaching the patches to a
>> single email?  This is the correct way of sending patch series and it
>> gets you around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
>
> Are there any working instructions for newlib contributions to be found
> anywhere?

Due to extra deps, git-send-email is usually packaged separately.

On Cygwin, the package name is 'git-email' ('Email tools for Git')

Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Brian Inglis
In reply to this post by Thomas Wolff
On 2017-12-03 07:07, Thomas Wolff wrote:
>> On Aug 17 07:53, Thomas Wolff wrote:
>> So why don't you use git send-email (ideally with a cover letter, see `git
>> format-patch --cover-letter') instead of attaching the patches to a single
>> email?  This is the correct way of sending patch series and it gets you
>> around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.

You need to install git-email, and if you run X you might also want git-gui.

Ensure $GIT_EDITOR|$VISUAL|$EDITOR stays in foreground so that you can edit
commit messages, emails, interactive rebases, merges, etc. e.g.:
        git config --global core.editor 'gvim -f'

You should not need LC_ALL=C most places these days, except to get sort, join,
uniq to play well together.

> Are there any working instructions for newlib contributions to be found
> anywhere?

Everyone assumes you are comfortable with git and understand its model.
Advice I git:

cd  .../repo

git checkout master
git pull
git checkout -b BRANCH

$VISUAL FILE
git add FILE
git commit FILE
...

git format-patch -o PATH/ --stat --cover-letter -#commits
$VISUAL PATH/0000-*cover-letter.patch
git send-email --compose PATH/000?-*.patch

Update branch to the latest upstream master:

git checkout master
git pull
git rebase [-i] master BRANCH

...

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
In reply to this post by Thomas Wolff
On Dec  3 15:07, Thomas Wolff wrote:

> > On Aug 17 07:53, Thomas Wolff wrote:
> > > [...]
> > > I have fixed locale handling in the isw* and tow* functions, but I've not
> > > yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
> > > are not as straight-forward to be employed as I thought, because the work on
> > > multi-byte representations.
> > > Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
> > > handled in iconvdata (JIS-201/208/212) is a little bit obscure.
> > > Likewise obscure is the relation between newlib/iconvdata and
> > > newlib/libc/iconv.
> > This is really old stuff.  I wonder if anybody is still using it with
> > Unicode around for a long time...
I forgot to mention, I think your approach to keep this is the best
one for now so as not to break anything for small targets.

> > > To be on the safe side, I’m leaving the actual jp2uc conversion untouched
> > > for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
> > > If the #warning is ignored or removed, the non-Cygwin build should work as
> > > before, fixing just locale handling.
> > >
> > > I'm attaching the wcwidth part here, all patches are available at
> > > http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
> > > mailbox size limit).
> > So why don't you use git send-email (ideally with a cover letter, see
> > `git format-patch --cover-letter') instead of attaching the patches to a
> > single email?  This is the correct way of sending patch series and it
> > gets you around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
>
> Are there any working instructions for newlib contributions to be found
> anywhere?
Jon and Brian answered that.

> > The below patches are missing a patch, last one is patch 3/4.
> >
> > Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
> > an extra Makefile?  This should be merged into string/Makefile.am.
> I was awaiting feedback before doing further integration work. The Makefile
> includes test stuff and also the table generation. Would the generation be
> invoked every time or rather called manually?

Keeping generated files in the repos is frowned upon these days, but
we're doing this for a pretty long time already and don't want everybody
having to do these, mostly awkward steps.  So, yeah, keeping the tables
in the repo and manually calling the generation targets sounds right to
me.  Only maintainers (or interested parties) need to do this once in a
while.


Thanks,
Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff
I have finally revamped, manually rebased, and repackaged my Unicode
data patches which I'll send in separate mail.
However, as I don't have a command-line sendmail set up (and apparently
it's not as easy as it used to be),
I'll send zip archives which contain git-patch files.
There are two patches:
libc/string: wcwidth using generated width data, with data generated
from Unicode 10.0
libc/ctype: isw* and tow* functions using generated case conversion and
character class data, with Unicode 10.0 data
For both, generation script and a Makefile.widthdata / Makefile.chardata
is included. As these are to be used in the source directory,
not the binary target directory, in case of future Unicode update, they
are not related to the other Makefiles.
In ctype/, there is one new source (categories.c) which should be
compiled separately but although I tried to include it in Makefile.am,
I could not get the build process to compile it. So the current solution
is to include it from one of the other sources (the one that also
maintains the case conversion table).
Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Corinna Vinschen
On Feb 25 18:14, Thomas Wolff wrote:
> I have finally revamped, manually rebased, and repackaged my Unicode data
> patches which I'll send in separate mail.
> However, as I don't have a command-line sendmail set up (and apparently it's
> not as easy as it used to be),
> I'll send zip archives which contain git-patch files.

No, sorry, but no.  It's not that tricky to send standard git patch
series, we're all doing this.  If your MUA doesn't fit, use another MUA
or *attach* the patches, one per mail.

> There are two patches:
> libc/string: wcwidth using generated width data, with data generated from
> Unicode 10.0
> libc/ctype: isw* and tow* functions using generated case conversion and
> character class data, with Unicode 10.0 data
> For both, generation script and a Makefile.widthdata / Makefile.chardata is
> included. As these are to be used in the source directory,
> not the binary target directory, in case of future Unicode update, they are
> not related to the other Makefiles.

Eh, what?  If you read back, I had no problems with your patches 2 and
3, only with patch 1 adding new makefiles.  So the only thing I actually
asked for was to integrate the creation of the generated tables into
Makefile.am and now you're telling me this is not what you changed...?

> In ctype/, there is one new source (categories.c) which should be compiled
> separately but although I tried to include it in Makefile.am,
> I could not get the build process to compile it. So the current solution is
> to include it from one of the other sources (the one that also maintains the
> case conversion table).

That's a workaround, not a solution.  When you change Makefile.am you
have to regenerate Makefile.in, obviously.

However, since regenerating Makefile.in for newlib is (unfortunately,
for historical reasons) non-obvious, you can just go ahead and manually
add categories.* to Makefile.in where it belongs, kind of like the
attached.  A later regeneration run by one of the maintainers will fix
the formatting so that's nothing to worry about.


Corinna

--
Corinna Vinschen
Cygwin Maintainer
Red Hat

x (2K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Thomas Wolff
Am 26.02.2018 um 18:20 schrieb Corinna Vinschen:
> On Feb 25 18:14, Thomas Wolff wrote:
>> I have finally revamped, manually rebased, and repackaged my Unicode data
>> patches which I'll send in separate mail.
>> ...
> ...
> or *attach* the patches, one per mail.
That will do, thanks.

>> There are two patches:
>> libc/string: wcwidth using generated width data, with data generated from
>> Unicode 10.0
>> libc/ctype: isw* and tow* functions using generated case conversion and
>> character class data, with Unicode 10.0 data
>> For both, generation script and a Makefile.widthdata / Makefile.chardata is
>> included. As these are to be used in the source directory,
>> not the binary target directory, in case of future Unicode update, they are
>> not related to the other Makefiles.
> Eh, what?  If you read back, I had no problems with your patches 2 and
> 3, only with patch 1 adding new makefiles.  So the only thing I actually
> asked for was to integrate the creation of the generated tables into
> Makefile.am and now you're telling me this is not what you changed...?
First I added an include to the generation makefile into Makefile.am,
then it occurred to me that the generated makefile resides in the target
hierarchy while the generation should probably be invoked in the source
directory, so I removed it again.
I'm not sure about the best or preferred invocation interface for such a
step. Maybe I should just provide the generation scripts (mk*) and leave
it up to you to integrate them into the Makefile.am.

>> In ctype/, there is one new source (categories.c) which should be compiled
>> separately but although I tried to include it in Makefile.am,
>> I could not get the build process to compile it. So the current solution is
>> to include it from one of the other sources (the one that also maintains the
>> case conversion table).
> That's a workaround, not a solution.  When you change Makefile.am you
> have to regenerate Makefile.in, obviously.
>
> However, since regenerating Makefile.in for newlib is (unfortunately,
> for historical reasons) non-obvious, you can just go ahead and manually
> add categories.* to Makefile.in where it belongs, kind of like the
> attached.
Thanks for the patch.

I'll resubmit the wcwidth patch soon, maybe you can tell me how you'd
like the data generation to be invoked, or I can submit it just with the
script. I'll submit the ctype patch later; all works fine on my Windows
10 system but there is some obscure trouble on a Windows 7 system which
I'd like to check out first.
And there's also a locale patch which I presented a separate mail.

Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Unicode update of width and other character properties

Hans-Bernhard Bröker
Am 26.02.2018 um 21:02 schrieb Thomas Wolff:

> First I added an include to the generation makefile into Makefile.am,
> then it occurred to me that the generated makefile resides in the target
> hierarchy while the generation should probably be invoked in the source
> directory, so I removed it again.

Which directory the tool is to be invoked in is irrelevant for this.
Anything you put into a Makefile.am will automatically go into
Makefile.in and Makefile, anyway.  Well, assuming you actually run the
autotools, that is.

If you need to run things from a inside the src tree, you're supposed to
apply something like

        cd $(srcdir) &&
        cd $(top_srcdir)/some/where &&

to them just like, e.g. the auto-generated production rules for
Makefile.in and configure do it.