C++ Logo

sg16

Advanced search

[SG16-Unicode] Text for updating the unicode reference as submitted

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 08 May 2018 01:18:14 +0000
Update The Reference To The Unicode Standard

   - Document number: P1025R0
   - Date: 2018-04-23
   - Author: Steve Downey <sdowney2_at_[hidden]>
   - Audience: Core, LWG, SG16

Abstract

The reference to the Unicode Standard in the C++ Standard should be updated
to the stable base standard or any successor standard.
References

P0417R1
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0417r1.html> :
C++17 should refer to ISO/IEC 10646 2014 instead of 1994 (R1)
Preferred New Reference

The Unicode Consortium, the entity responsible for the Unicode standard,
documents the preferred citations
<http://www.unicode.org/versions/index.html#Citations> for the the Unicode
Standard. The current standard is version 10.0. The existing reference
should be changed to:

The Unicode Standard, Version 10.0 or later

The Unicode Consortium. The Unicode Standard, Version 10.0.0, (Mountain
View, CA: The Unicode Consortium, 2017. ISBN 978-1-936213-16-0)
http://www.unicode.org/versions/Unicode10.0.0/

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

The reason for not referring to the equivalent ISO Standard, 10646, is that
the ISO standard is incomplete with respect to the Unicode Standard. From the
Unicode and ISO 10646 FAQ <http://unicode.org/faq/unicode_iso.html>

Although the character codes and encoding forms are synchronized between
Unicode and ISO/IEC 10646, the Unicode Standard imposes additional
constraints on implementations to ensure that they treat characters
uniformly across platforms and applications. To this end, it supplies an
extensive set of functional character specifications, character data,
algorithms and substantial background material that is not in ISO/IEC 10646.

For existing purposes, the C++ Standard is only concerned with character
codes and encoding forms. However, to standardise any Unicode text
processing, the algorithms and character data will need to be referenced.
Therefore, we might as well update the reference now.

Referring to 10.0 or later sets a baseline, but allows implementors to move
to later standards, including new emoji, at their discretion.

The equivalent to the 10.0 standard is ISO/IEC 10646:2017 with some
additions from the first amendment to 10646. If there are strong reasons
not to refer to the Unicode Standard itself, the reference for character
sets and encoding should be changed to:

ISO/IEC 10646:2017 Information technology – Universal Coded Character Set
(UCS) plus 10646:2017/DAmd 1, or successor

The 'or successor' wording is borrowed from the current ECMAScript
standard, ECMAScript® 2017 Language Specification (ECMA-262, 8th edition,
June 2017)
<https://www.ecma-international.org/ecma-262/8.0/index.html#sec-normative-references>.
The 'or successor' language has been in place since at least the 2015
standard.

The Unicode Consortium has made a number of stability guarantees based on
the referenced standard, promising that any currently conforming Unicode
text will continue to be interpreted the same way in the future for
purposes of encoding, collation, registration, and locales. They are
documented as part of their policies
<https://www.unicode.org/policies/policies.html>.

This means that it is safe to allow implementations to adopt newer Unicode
standards without affecting the interpretation of existing conforming text.
Since in practice, due to customer demand, everyone ships the latest
Unicode data and algorithms available, this allows conformance to existing
practice, particularly as new, advanced, unicode libraries are incorporated
into the standard.
Immediate Effects

The Unicode standard that the C++ Standard refers to predates UTF-16 and
UTF-32, instead defining UCS2 and UCS4. Moving to a newer standard would
make the former terms well defined in the C++ Standard. It has been argued
that the ECMAScript standard referred to uses a newer Unicode standard, in
which those terms are defined, so those terms are defined for the C++
Standard by transitive reference. If that argument is accepted, then moving
to the newer version makes the intent explicit.

In addition, in 1996, as part of amendments 5, 6 and 7, the original set of
Hangul characters were removed and added at a new location, as well as
Tibetan characters added again. This places the current citation in the
standard of "ISO/IEC 10646-1:1993" in conflict with the version imported by
way of the ECMAScript standard. In practice, all implementors adopt the
later version for conversion operations.

The Wikipidia article on Unicode
<https://en.wikipedia.org/wiki/Unicode#Versions> has a summary of the
changes over the years.
UCS2 and UCS4 in codecvt facets

The last proposal to update the Unicode Standard reference, P0417R1, was
entangled with deprecation of UCS2 and UCS4. The remaining references are
in the now deprecated codecvt facets [depr.locale.stdcvt.req]. There is
resistance to changing those to UTF-16 and UTF-32, since, particularly for
UCS2, there are real changes in behavior. UTF-32 can be viewed as UCS4.
UTF-16 can not be similarly viewed as UCS2. Since there may be users of the
facility depending on the behavior as it was when standardized this paper
does not propose changing them, but instead leaving them in place, as
deprecated features, with no formal definition, as there is none to refer
to anymore. This should not be interpreted as requiring any onus on
implememtors to change the existing, deprecated, facets.
__STDC_ISO_10646__ macro

The macro __STDC_ISO_10646__ in [cpp.predefined] can be left unchanged. The
ISO/IEC 10646 version will be the version that corresponds to the Unicode
Standard in effect.
Fall-back Reference

The current Unicode standard, 10.0, is equivalent to

10646:2017, fifth edition, plus the following additions from Amendment 1 to
the fifth edition:

56 emoji characters

285 hentaigana

3 additional Zanabazar Square characters

according to the Unicode 10.0 Standard
<https://www.unicode.org/versions/Unicode10.0.0/>

The 2017 standard is ISO/IEC 10646:2017 so as a fall-back position, the
standard should be updated to

ISO/IEC 10646:2017 Information technology – Universal Coded Character Set
(UCS) plus 10646:2017/DAmd 1

Without reference to the latest standard.
Proposed Changes

*Strike the wording high-lighted in red and add the wording high-lighted in
green.*

1.2 Normative references [intro.refs]

— ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet
Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual
Plane

— The Unicode Consortium. The Unicode Standard, Version 10.0.0, (Mountain
View, CA: The Unicode Consortium, 2017. ISBN 978-1-936213-16-0)
http://www.unicode.org/versions/Unicode10.0.0/

— The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

— ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded
Character Set (UCS)

Add:

5 The ISO/IEC 10646 version is the corresponding version to the Unicode
Standard, as documented by the Unicode Standard. For version 10.0 this is
ISO/IEC 10646:2017 plus 10646:2017/DAmd 1.

Links

   1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0417r1.html
   2. http://www.unicode.org/versions/index.html#Citations
   3. http://unicode.org/faq/unicode_iso.html
   4.
   https://www.ecma-international.org/ecma-262/8.0/index.html#sec-normative-references
   5. https://www.unicode.org/policies/policies.html
   6. https://en.wikipedia.org/wiki/Unicode#Versions
   7. https://www.unicode.org/versions/Unicode10.0.0/

Received on 2018-05-08 03:18:27