Subject: Re: [SG16-Unicode] code_unit_sequence and code_point_sequence
From: Lyberta (lyberta_at_[hidden])
Date: 2018-06-20 00:52:00
> Wanting to work with code points and code units is certainly fine and
> often necessary.Â But what need is served by these new containers that
> isn't already served by existing containers (std::vector, std::string,
> QString, etc...)
There is no container adaptor that takes a sequence of code units and
adapts it as a sequence of code points.
> I agree with this goal and we've been discussing ways to accomplish it.Â
> I've been intending to write a proposal to make it possible to extract
> the buffer from std::vector and std::string (and eventually std::text)
> and move it between such containers.Â This would work similarly to node
> extraction for node based containers and would introduce a type similar
> to node handle .Â The intent is to enable something like:
> std::vector<char> v = ...;
> std::string s(std::move(v.extract));
> std::text t(std::move(s.extract());
> QString and other custom string types would be able to opt-in to the
This demands something that doesn't exist yet for 3rd party types and
will limit compatibility with them even further.
> I don't see a convincing argument here and I'm still struggling to
> understand your perspective.Â In my opinion, having the endian concerns
> and byte level access wrapped in an encoding class that has a well known
> name is quite convenient.Â Is your concern primarily philosophical?Â
> Perhaps based on separation of concerns?
Yes. Code point layer or layers above it shouldn't care about endianness
or BOM. Only byte layer. I have written serialization library (already
ported to std::endian btw) and I will implement endianness and BOM
handling so I will have code samples ready.
> I can see uses for a codec that deals only with the endian concerns.Â
> But why force programmers concerned with working with text to explicitly
> interact with such a type?
I idea that programmers won't need to.
std::text t = u8"Hello";
Type of text will be
Here is standard library has chosen native endianness and no reading or
writing of BOM - a sane default. Then we provide helpers such as:
auto t = std::make_text<std::endian::big, std::bom>(u8"Hello");
Type of text will be
Here programmer has explicitly requested for BE with reading and writing
of BOM. std::bom and std::no_bom are just placeholders, this should be
an enum class.
> Is it your intention that, given u"text" (UTF-16), that it should be
> possible to obtain a std::byte pointer to the underlying code units
> (e.g., a sequence of bytes in either BE or LE order)?Â If so, that isn't
> possible because some systems have 16-bit (or larger) bytes where
> sizeof(char16_t)==1; the individual octets of char16_t are not
> addressable on such systems.
char16_t and char32_t are POD (I know this concept is deprecated) so
doing reinterpret_cast<std::byte*>(&char16_t_variable) is legal and is
necessary for serialization. But I don't know how text handling (or
especially file I/O) works on such systems. It is possible to portably
work with octets by holding them in std::uint_least8_t. It is possible
to convert char16_t to sequence of octets by doing masking and bit
shifts. On a system where CHAR_BIT > 8 char8_t will have unused bits, on
a system where CHAR_BIT > 16 char16_t will have unused bits, on systems
where CHAR_BIT > 32 char32_t will have unused bits. That's not a problem.
> Technically, Windows doesn't violate the standard by having a 16-bit
> wchar_t. It violates the standard by using a wide execution character
> set that defines code points that do not fit in it's (16-bit) wchar_t
> type. We have an issue (https://github.com/sg16-unicode/sg16/issues/9)
> to track modifying the standard to enable Microsoft's implementation to
> be conforming.
My personal opinion is that wchar_t has failed to achieve its goal and
should be removed from the standard. I guess we will have to wait 20+
years for that.
SG16 list run by email@example.com