On Thu, Mar 7, 2019 at 7:19 PM Tom Honermann <tom@honermann.net> wrote:

I think the committee currently has a UTF-8 bias that doesn't necessarily reflect the global C++ community.  We don't have much representation from Japan or China where, as I understand it, Shift-JIS and GB18030 still have significant usage.  We also have few, if any, z/OS users in the committee outside of IBM representatives.

Not to be dismissive, but z/OS developers are a tiny subset of C++ developers. Even when targeting z series hardware (we have for a few years now), there is the option of using linux which seems to be a fully supported platform. If supporting z/OS makes the experience worse or more complicated for other users, then I think the best option for the broader ecosystem is to leave it out of scope for the TR. That platform can offer an equivalent mechanism that better fits its eccentricities. I want to point out that EBCDIC seems to be the only remaining encoding that isn't an ASCII-superset (shift-jis replaces 2 characters in ASCII, but they don't matter for our purposes), so to support it we would be taking on substantial additional complexity that is only needed for that one niche platform.

  UTF-8 dominates the web, no one questions that.  But within the C++ ecosystem, I don't think UTF-8 dominates to a similar degree, at least not outside of the US and Europe.  I wish I had data to back that up.

From http://www.tomazos.com/actcd16.pdf: "We executed standard C++ translation phase 1 through 3 on the source files assuming a UTF­8encoding. We found that 99.0% of the source files tokenized successfully. Of the remaining1.0% the majority of the errors were decoding problems (most likely from ISO­8859 / Latin1encoding)"

This was a scan of all C and C++ packages in Ubuntu. While that obviously only represents the open source, unix-targetting subset of the C++ community, this seems to imply that for that sub-community utf-8 (and the ascii subset) dominates the source content. On top of that, I would expect file names to have even less non-ascii characters that file content, since it is common to limit non-ascii characters to comments and strings.