Character classifications and language-dependence

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Character classifications and language-dependence

Ludovic Courtès
Hi,

Currently, many locale definition files that come with glibc (actually
mostly those of western languages) include the "i18n" FDCC-set under
their `LC_CTYPE' category.

However, the "i18n" FDCC-set contains a very broad character
classification: it considers at least all Latin, Greek and Cyrillic
letters as part of the `alpha' character class (as seen in Section 4.3.2
of ISO 14652 [0] and glibc's version).  Thus, all the languages whose
locale includes "i18n" end up having a lot of letters in their `alpha'
character class, more than actually exist in the language.

For instance, while `ê' (`e' circumflex) is a letter in French, it is
not a letter in Castellano; likewise, `ñ' is a letter in Castellano, but
not in French.  But since glibc's locale definitions for `fr_FR' and
`es_ES' both include "i18n", `isalpha(3)' returns true for both locales.

Section 4 of ISO 14652 reads:

  This Technical Report also defines an FDCC-set named "i18n" with
  values for some of the above categories in order to simplify FDCC-set
  descriptions for a number of cultures.  The contents of "i18n"
  categories should not necessarily be considered as the most commonly
  accepted values, while in many cases it could be the recommended
  values.

Thus, my understanding is that glibc's heavy use of "i18n" for character
classifications is acceptable, though not representative of "the most
commonly accepted values".  Therefore, one could for instance refine the
`fr_FR' character classification so that only French letters (e.g., not
`ñ') are found under its `alpha' class.

Is this correct?  If so, are there plans to actually refine (some of)
these character classifications?

Thanks,
Ludovic.

[0] http://www.open-std.org/jtc1/sc22/wg20/docs/projects#14652
Reply | Threaded
Open this post in threaded view
|

Re: Character classifications and language-dependence

Keld Jørn Simonsen
On Thu, Sep 14, 2006 at 06:36:31PM +0200, Ludovic Courtès wrote:

> Hi,
>
> Currently, many locale definition files that come with glibc (actually
> mostly those of western languages) include the "i18n" FDCC-set under
> their `LC_CTYPE' category.
>
> However, the "i18n" FDCC-set contains a very broad character
> classification: it considers at least all Latin, Greek and Cyrillic
> letters as part of the `alpha' character class (as seen in Section 4.3.2
> of ISO 14652 [0] and glibc's version).  Thus, all the languages whose
> locale includes "i18n" end up having a lot of letters in their `alpha'
> character class, more than actually exist in the language.
>
> For instance, while `ê' (`e' circumflex) is a letter in French, it is
> not a letter in Castellano; likewise, `ñ' is a letter in Castellano, but
> not in French.  But since glibc's locale definitions for `fr_FR' and
> `es_ES' both include "i18n", `isalpha(3)' returns true for both locales.
>
> Section 4 of ISO 14652 reads:
>
>   This Technical Report also defines an FDCC-set named "i18n" with
>   values for some of the above categories in order to simplify FDCC-set
>   descriptions for a number of cultures.  The contents of "i18n"
>   categories should not necessarily be considered as the most commonly
>   accepted values, while in many cases it could be the recommended
>   values.
>
> Thus, my understanding is that glibc's heavy use of "i18n" for character
> classifications is acceptable, though not representative of "the most
> commonly accepted values".  Therefore, one could for instance refine the
> `fr_FR' character classification so that only French letters (e.g., not
> `ñ') are found under its `alpha' class.
>
> Is this correct?  If so, are there plans to actually refine (some of)
> these character classifications?

The reasoning behind considering a-circumflex and the like a letter,
also in languages not normally using it, is that in general readers will
recognize it as a letter, and somewhat know how to pronounce it etc.
Thus in Denmark â is used for example in names of French wines, like
"Château de Bonfils" and this may occur regularily eg. in newpaper
advertisements, or on menus in restaturants. It is thus good to know
that â can be part of a word, and thus it should be in class alpha of
this locale. The same would be valid for possibly all other locales
of the world.

I don't know if there is any work on some locales to change this,
but I would recommend against it. However, one could think of creating
new classes for specific purposes. What would your use be?

best regards
keld
Reply | Threaded
Open this post in threaded view
|

Re: Character classifications and language-dependence

Ludovic Courtès
Hi,

Keld Jørn Simonsen <[hidden email]> writes:

> The reasoning behind considering a-circumflex and the like a letter,
> also in languages not normally using it, is that in general readers will
> recognize it as a letter, and somewhat know how to pronounce it etc.
> Thus in Denmark â is used for example in names of French wines, like
> "Château de Bonfils" and this may occur regularily eg. in newpaper
> advertisements, or on menus in restaturants. It is thus good to know
> that â can be part of a word, and thus it should be in class alpha of
> this locale. The same would be valid for possibly all other locales
> of the world.

This is a good point.  More generally, readers of variants of the Latin
alphabet will recognize accented Latin letters as letters.

OTOH, "i18n" also includes letters from other alphabets, like Greek and
Cyrillic, and it is unclear whether all those alphabets (and variants
thereof) can be considered "mutually recognizable" by their readers.

"Recognizability" of a letter is probably very subjective.  For
instance, accented letters found in Castellano, Italian, and French,
certainly look familiar to each other.  However, accented Latin letters
found in Central and Eastern European languages (e.g., `e' with cedilla,
as in Polish -- more generally, Latin letters not part of Latin-1)
certainly look very "unusual" to readers of French, Castellano, Italian,
etc...

> I don't know if there is any work on some locales to change this,
> but I would recommend against it. However, one could think of creating
> new classes for specific purposes. What would your use be?

Actually, I don't have any specific use case in mind.  Since the UCD
already allows the construction of a list of "all existing letters",
regardless of the language or script they "belong" to, my feeling was
that, conversely, locales could provide more language-specific
knowledge.

Initially, I was just wondering whether this broad and (to some extent)
language-independent character classification is glibc-specific, or
whether it is following some standard or recommendation.

Thanks,
Ludovic.
Reply | Threaded
Open this post in threaded view
|

Re: Character classifications and language-dependence

Keld Jørn Simonsen
On Fri, Sep 15, 2006 at 09:51:52AM +0200, Ludovic Courtès wrote:

> Hi,
>
> Keld Jørn Simonsen <[hidden email]> writes:
>
> > The reasoning behind considering a-circumflex and the like a letter,
> > also in languages not normally using it, is that in general readers will
> > recognize it as a letter, and somewhat know how to pronounce it etc.
> > Thus in Denmark â is used for example in names of French wines, like
> > "Château de Bonfils" and this may occur regularily eg. in newpaper
> > advertisements, or on menus in restaturants. It is thus good to know
> > that â can be part of a word, and thus it should be in class alpha of
> > this locale. The same would be valid for possibly all other locales
> > of the world.
>
> This is a good point.  More generally, readers of variants of the Latin
> alphabet will recognize accented Latin letters as letters.
>
> OTOH, "i18n" also includes letters from other alphabets, like Greek and
> Cyrillic, and it is unclear whether all those alphabets (and variants
> thereof) can be considered "mutually recognizable" by their readers.
>
> "Recognizability" of a letter is probably very subjective.  For
> instance, accented letters found in Castellano, Italian, and French,
> certainly look familiar to each other.  However, accented Latin letters
> found in Central and Eastern European languages (e.g., `e' with cedilla,
> as in Polish -- more generally, Latin letters not part of Latin-1)
> certainly look very "unusual" to readers of French, Castellano, Italian,
> etc...

My first observation is that when these strange characters occur, it is
for a reason. There is an intended audience that will understand what is
written, and for those, as they would know how to read it, then it
should follow the rules for the characters and scripts in question.

My other observation is that in the EU, where both you and I live, all
citizens are required by law to be treated equally, in every member
state of the EU. That IMHO includes that every citizen has a right to
have his or her name spelled correctly. Now the EU includes countries
like Poland (with weird character) and Denmark (weird characters like æøå) and
Greece (with a lot of weird characters) and soon to be member Bulgaria,
which uses the Cyrillic script. So for all public institutions there is
a requirement emerging to be able to handle all these letters in all
these scripts. Making locales that only is valid for the public sector,
and then other locales for the private sector and such seems not a good
way forward.

> > I don't know if there is any work on some locales to change this,
> > but I would recommend against it. However, one could think of creating
> > new classes for specific purposes. What would your use be?
>
> Actually, I don't have any specific use case in mind.  Since the UCD
> already allows the construction of a list of "all existing letters",
> regardless of the language or script they "belong" to, my feeling was
> that, conversely, locales could provide more language-specific
> knowledge.
>
> Initially, I was just wondering whether this broad and (to some extent)
> language-independent character classification is glibc-specific, or
> whether it is following some standard or recommendation.

AFAIK glibc follows ISO 14652 recommendations, which essensially is the
same as what Unicode advocates: that all the letters of the different
script and also the ideographics are considered belonging to class
alpha.

I think changing this would change current behaviour, in many times
unexpectdedly. That is why I would rather have a new class for this, and
with some explicit field of usage, so that programmers using this class
whould know what to expect, worldwide.

best regards
keld
Reply | Threaded
Open this post in threaded view
|

Re: Character classifications and language-dependence

Ludovic Courtès
Hi,

Keld Jørn Simonsen <[hidden email]> writes:

> On Fri, Sep 15, 2006 at 09:51:52AM +0200, Ludovic Courtès wrote:
>> This is a good point.  More generally, readers of variants of the Latin
>> alphabet will recognize accented Latin letters as letters.
>>
>> OTOH, "i18n" also includes letters from other alphabets, like Greek and
>> Cyrillic, and it is unclear whether all those alphabets (and variants
>> thereof) can be considered "mutually recognizable" by their readers.
>>
>> "Recognizability" of a letter is probably very subjective.  For
>> instance, accented letters found in Castellano, Italian, and French,
>> certainly look familiar to each other.  However, accented Latin letters
>> found in Central and Eastern European languages (e.g., `e' with cedilla,
>> as in Polish -- more generally, Latin letters not part of Latin-1)
>> certainly look very "unusual" to readers of French, Castellano, Italian,
>> etc...
>
> My first observation is that when these strange characters occur, it is
> for a reason. There is an intended audience that will understand what is
> written, and for those, as they would know how to read it, then it
> should follow the rules for the characters and scripts in question.
>
> My other observation is that in the EU, where both you and I live, all
> citizens are required by law to be treated equally, in every member
> state of the EU. [...]

Sorry if that wasn't clear from my previous email, but I fully agree
with you as far as respect of cultures and languages is concerned.  IMO,
that obviously is not limited to the EU.  Also, respecting languages
implies that phrases such as "these strange characters" should be
considered inappropriate.

Anyway, this is my personal opinion and this is not what I wanted to
talk about in the first place.

>> Initially, I was just wondering whether this broad and (to some extent)
>> language-independent character classification is glibc-specific, or
>> whether it is following some standard or recommendation.
>
> AFAIK glibc follows ISO 14652 recommendations, which essensially is the
> same as what Unicode advocates: that all the letters of the different
> script and also the ideographics are considered belonging to class
> alpha.

So perhaps the ISO 14652 paragraph about the "i18n" FDCC-set that I
quoted in my first message should be interpreted as a recommendation to
include "i18n" in all locales?  Is it what you meant?

If this is the case, the language-independent character classification
found in glibc is not glibc-specific but standard-conforming.

Thanks,
Ludovic.
Reply | Threaded
Open this post in threaded view
|

Re: Character classifications and language-dependence

Keld Jørn Simonsen
On Sat, Sep 16, 2006 at 04:20:58PM +0200, Ludovic Courtès wrote:

> Hi,
>
> Keld Jørn Simonsen <[hidden email]> writes:
>
> >> Initially, I was just wondering whether this broad and (to some extent)
> >> language-independent character classification is glibc-specific, or
> >> whether it is following some standard or recommendation.
> >
> > AFAIK glibc follows ISO 14652 recommendations, which essensially is the
> > same as what Unicode advocates: that all the letters of the different
> > script and also the ideographics are considered belonging to class
> > alpha.
>
> So perhaps the ISO 14652 paragraph about the "i18n" FDCC-set that I
> quoted in my first message should be interpreted as a recommendation to
> include "i18n" in all locales?  Is it what you meant?

yes, that is a recommendation.

> If this is the case, the language-independent character classification
> found in glibc is not glibc-specific but standard-conforming.

Yes, kind of. 14652 is a TR, and there is no formal conformance.

best regards
keld