sg15: Re: [Tooling] [isocpp-modules] Filename requirements for the SG15 TR

From: Mathias Stearn <redbeard0531+isocpp_at_[hidden]>
Date: Fri, 8 Mar 2019 10:31:40 -0500

On Thu, Mar 7, 2019 at 7:19 PM Tom Honermann <tom_at_[hidden]> wrote:

> I think the committee currently has a UTF-8 bias that doesn't necessarily
> reflect the global C++ community. We don't have much representation from
> Japan or China where, as I understand it, Shift-JIS and GB18030 still have
> significant usage. We also have few, if any, z/OS users in the committee
> outside of IBM representatives.
>
Not to be dismissive, but z/OS developers are a tiny subset of C++
developers. Even when targeting z series hardware (we have for a few years
now), there is the option of using linux which seems to be a fully
supported platform. If supporting z/OS makes the experience worse or more
complicated for other users, then I think the best option for the broader
ecosystem is to leave it out of scope for the TR. That platform can offer
an equivalent mechanism that better fits its eccentricities. I want to
point out that EBCDIC seems to be the only remaining encoding that isn't an
ASCII-superset (shift-jis replaces 2 characters in ASCII, but they don't
matter for our purposes), so to support it we would be taking on
substantial additional complexity that is only needed for that one niche
platform.

> UTF-8 dominates the web, no one questions that. But within the C++
> ecosystem, I don't think UTF-8 dominates to a similar degree, at least not
> outside of the US and Europe. I wish I had data to back that up.
>
>From http://www.tomazos.com/actcd16.pdf: "We executed standard C++
translation phase 1 through 3 on the source files assuming a UTF8encoding.
We found that 99.0% of the source files tokenized successfully. Of the
remaining1.0% the majority of the errors were decoding problems (most
likely from ISO8859 / Latin1encoding)"

This was a scan of all C and C++ packages in Ubuntu. While that obviously
only represents the open source, unix-targetting subset of the C++
community, this seems to imply that for that sub-community utf-8 (and the
ascii subset) dominates the source content. On top of that, I would expect
file names to have even less non-ascii characters that file content, since
it is common to limit non-ascii characters to comments and strings.

Received on 2019-03-08 16:31:55