C++ Logo

sg15

Advanced search

Re: [Tooling] [isocpp-modules] Filename requirements for the SG15 TR

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 8 Mar 2019 17:28:42 -0500
On 3/8/19 10:31 AM, Mathias Stearn wrote:
>
>
> On Thu, Mar 7, 2019 at 7:19 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> I think the committee currently has a UTF-8 bias that doesn't
> necessarily reflect the global C++ community. We don't have much
> representation from Japan or China where, as I understand it,
> Shift-JIS and GB18030 still have significant usage. We also have
> few, if any, z/OS users in the committee outside of IBM
> representatives.
>
> Not to be dismissive, but z/OS developers are a tiny subset of C++
> developers.
This is true, but they also service an important market and already face
challenges due to being in a more niche space. If we can reasonably
make things easier for them, I think we should.
> Even when targeting z series hardware (we have for a few years now),
> there is the option of using linux which seems to be a fully supported
> platform.

Linux on z is great, but not helpful for those that have actual z/OS
requirements.

> If supporting z/OS makes the experience worse or more complicated for
> other users, then I think the best option for the broader ecosystem is
> to leave it out of scope for the TR. That platform can offer an
> equivalent mechanism that better fits its eccentricities. I want to
> point out that EBCDIC seems to be the only remaining encoding that
> isn't an ASCII-superset (shift-jis replaces 2 characters in ASCII, but
> they don't matter for our purposes), so to support it we would be
> taking on substantial additional complexity that is only needed for
> that one niche platform.
I don't consider any of what we've discussed so far as proposing
substantial additional complexity. In fact, what we've discussed is
also relevant to ASCII platforms.
>
> UTF-8 dominates the web, no one questions that. But within the
> C++ ecosystem, I don't think UTF-8 dominates to a similar degree,
> at least not outside of the US and Europe. I wish I had data to
> back that up.
>
> From http://www.tomazos.com/actcd16.pdf: "We executed standard C++
> translation phase 1 through 3 on the source files assuming a
> UTF­8encoding. We found that 99.0% of the source files tokenized
> successfully. Of the remaining1.0% the majority of the errors were
> decoding problems (most likely from ISO­8859 / Latin1encoding)"
>
> This was a scan of all C and C++ packages in Ubuntu. While that
> obviously only represents the open source, unix-targetting subset of
> the C++ community, this seems to imply that for that sub-community
> utf-8 (and the ascii subset) dominates the source content. On top of
> that, I would expect file names to have even less non-ascii characters
> that file content, since it is common to limit non-ascii characters to
> comments and strings.

For that subset, I agree and those results match my expectations for
that subset. Worth noting that the survey doesn't answer the question
of what might break if characters outside the ASCII range were
introduced into that 99% of source files. e.g., those files aren't
necessarily consumed as UTF-8.

Tom.


Received on 2019-03-08 23:28:46