Re: [Tooling] [isocpp-modules] Filename requirements for the SG15 TR

From: Mathias Stearn <redbeard0531+isocpp_at_[hidden]>
Date: Fri, 8 Mar 2019 10:31:40 -0500
On Thu, Mar 7, 2019 at 7:19 PM Tom Honermann <tom_at_[hidden]> wrote:

> I think the committee currently has a UTF-8 bias that doesn't necessarily
> reflect the global C++ community. We don't have much representation from
> Japan or China where, as I understand it, Shift-JIS and GB18030 still have
> significant usage. We also have few, if any, z/OS users in the committee
> outside of IBM representatives.
Not to be dismissive, but z/OS developers are a tiny subset of C++
developers. Even when targeting z series hardware (we have for a few years
now), there is the option of using linux which seems to be a fully
supported platform. If supporting z/OS makes the experience worse or more
complicated for other users, then I think the best option for the broader
ecosystem is to leave it out of scope for the TR. That platform can offer
an equivalent mechanism that better fits its eccentricities. I want to
point out that EBCDIC seems to be the only remaining encoding that isn't an
ASCII-superset (shift-jis replaces 2 characters in ASCII, but they don't
matter for our purposes), so to support it we would be taking on
substantial additional complexity that is only needed for that one niche

> UTF-8 dominates the web, no one questions that. But within the C++
> ecosystem, I don't think UTF-8 dominates to a similar degree, at least not
> outside of the US and Europe. I wish I had data to back that up.
>From http://www.tomazos.com/actcd16.pdf: "We executed standard C++
translation phase 1 through 3 on the source files assuming a UTF­8encoding.
We found that 99.0% of the source files tokenized successfully. Of the
remaining1.0% the majority of the errors were decoding problems (most
likely from ISO­8859 / Latin1encoding)"

This was a scan of all C and C++ packages in Ubuntu. While that obviously
only represents the open source, unix-targetting subset of the C++
community, this seems to imply that for that sub-community utf-8 (and the
ascii subset) dominates the source content. On top of that, I would expect
file names to have even less non-ascii characters that file content, since
it is common to limit non-ascii characters to comments and strings.

