Locale/charset combinations

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Locale/charset combinations

Leonard den Ottolander-4
Hi,

Why are currently so few locales/character set combinations supported?
There must be hundreds if not thousands of valid combinations for which
no entry exists in localedata/SUPPORTED. For certain character sets that
are defined there are no valid locale/charset cominations. This makes it
impossible to use tools like sed and grep using the [:alpha:] set of
regular expressions on files encoded with such character sets (unless I
am overseeing a way to override the used character set for a given
locale).

Of course it is possible to first convert such files using iconv, but it
would be much more convenient to be able to use grep and sed on the raw
data.

Because the definition of the character codes in the character maps
doesn't have any impact on the other locale settings, the introduction
of many more locale/charset combinations shouldn't cause any breakage
afaict (even for incorrectly chosen character sets).

To me the lack of a locale supporting the character set is most notable
for CP1252 (aka "MS-ANSI"). Much of the data I have to work with uses
this encoding, mostly in the form of MySQL table dumps using the latin1
charset - yes, MySQL's latin1 is CP1252, not ISO-8859-1, see
http://dev.mysql.com/doc/refman/5.0/en/charset-mysql.html.

Since the CP1252 code page and ISO-8859-1 and ISO-8859-15 are very
similar I suppose any locale currently supporting ISO-8859-1[5] should
be able to use CP1252.

It thus shouldn't be very intrusive to add the output of

$ for locale in $(grep "ISO-8859-1\(5\)\?\ " localedata/SUPPORTED | cut
-f 1 -d / | cut -f 1 -d \. | sort -u); do echo "${locale}.CP1252/CP1252
\\"; done

to localedata/SUPPORTED.

The same of course is true for f.e. the IBM and EBCDIC code pages and
their corresponding locale. F.e:

en_US.IBM437/IBM437 \
nl_NL.IBM437/IBM437 \
nl_NL.IBM850/IBM850 \

and

de_AT.EBCDIC-AT-DE/EBCDIC-AT-DE \
de_AT.EBCDIC-AT-DE-A/EBCDIC-AT-DE-A \
fr_CA.EBCDIC-CA-FR/EBCDIC-CA-FR \

should be valid additions to localedata/SUPPORTED.

There are a couple of locale/charset combinations currently unsupported
for which a character set is mentioned in the locale file:

$ grep -i charset az_AZ bg_BG mk_MK POSIX ur_PK wal_ET
az_AZ:% Charset: ISO-8859-9E
bg_BG:% this: bg_BG.CP1251 (CP1251 is for coresponding charset),
bg_BG.KOI8R,
mk_MK:% Charsets: UTF-8, ISO-8859-5, CP1251
POSIX:# Charset: ISO646:1993
ur_PK:% Charset: CP1256
wal_ET:% Charset: UTF-8

Thus the following should be valid additions to localedata/SUPPORTED:

az_AZ.ISO-8859-9/ISO-8859-9 \
bg_BG.KOI8R/KOI8R \
bg_BG.ISO-8859-5/ISO-8859-5 \
mk_MK.CP1251/CP1251 \
ur_PK.CP1256/CP1256 \
wal_ET/UTF-8 \

Note that I've left the defaults as are, so I add
ur_PK.CP1256/CP1256 \
instead of
-ur_PK/UTF-8 \
+ur_PK/CP1256 \
+ur_PK.UTF-8/UTF-8 \

Not sure about these as a POSIX locale is present although undefined in
localedata/SUPPORTED:

POSIX.ISO_646.BASIC \
POSIX.UTF-8/UTF-8 \

Leonard.


Reply | Threaded
Open this post in threaded view
|

Re: Locale/charset combinations

Daniel Jacobowitz-2
On Sun, Jun 18, 2006 at 04:54:01PM +0200, Leonard den Ottolander wrote:

> Hi,
>
> Why are currently so few locales/character set combinations supported?
> There must be hundreds if not thousands of valid combinations for which
> no entry exists in localedata/SUPPORTED. For certain character sets that
> are defined there are no valid locale/charset cominations. This makes it
> impossible to use tools like sed and grep using the [:alpha:] set of
> regular expressions on files encoded with such character sets (unless I
> am overseeing a way to override the used character set for a given
> locale).

What do you want that can't be achieved by setting LC_CTYPE?

--
Daniel Jacobowitz
CodeSourcery
Reply | Threaded
Open this post in threaded view
|

Re: Locale/charset combinations

Leonard den Ottolander-4
Hello Daniel,

On Fri, 2006-06-23 at 18:03 -0400, Daniel Jacobowitz wrote:
> On Sun, Jun 18, 2006 at 04:54:01PM +0200, Leonard den Ottolander wrote:
> > Why are currently so few locales/character set combinations supported?

> What do you want that can't be achieved by setting LC_CTYPE?

Setting LC_CTYPE does not do much good if the necessary compiled
locale/charset combinations aren't available on the system.

In the mean time people have drawn my attention to localedef as the way
to add locale/charset combinations to the system.

The reason the glibc developers choose to support only a limited set of
locale/charset combinations, even though hard linking redundant files
saves some disk space is to avoid bloat.

Another way to avoid bloat to me seems to scrap the whole compiled
locale tree from /usr/lib/locale and only work with the (compressed?)
locale-archive.

Leonard.