C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] std::byte based I/O library
From: Lyberta (lyberta_at_[hidden])
Date: 2019-02-07 04:20:00


Corentin:
> Ideally, we need 3 separate things:
>
> 1/ A way to read/write byte streams
> 2/ A way to transcode to/from non-unicode encoding
> 3/ A way to determine the encoding expected by a given stream.
>
> The later is in the general case not possible, and it might not be
> generically possible for simple things like console i/o.
> I think the only sane way forward is to by default assume utf8 everywhere
> and work with os vendors to ensure they have the same defaults.
>
> I think 1/ falls entirely outside of the scope of SG16

What about transforming text to bytes?

I maintain a C++20 serialization library and have plans to offer it for
standardization: https://gitlab.com/ftz/serialization

The key problem is that I'm not sure exactly how it would work with text.

Since execution character set and UTF-8 use bytes as code units,
[de]serializing them is effectively a memcpy. UTF-16 and UTF-32 otoh
require handling of endianness. My streams store user-supplied
endianness so there is automatic conversion during IO.

Consider this syntax used by my library:

BinaryOutputStream& stream;
std::u16string string{ ... };
stream.SetEndianness(std::endian::big);
Serialization::Write(stream, string);

In my opinion the default behavior would perform byteswap of each code
unit in little endian system before writing. What about BOM? That would
require something explicit. I think the generic way would be to have a
strong type that have special BOM handling during IO.

On the point 2/. I think this is easily done using ranges-like
customization points. My serialization library uses them for serializing
user-defined types. For text conversion a user will simply need to
customize something like std::to_unicode_code_point and
std::from_unicode_code_point, the rest of the code will simply use those.

On the point 3/. Byte streams don't have text encoding. It's up to the
user what encoding to write.

Is there a way for me to reach wider public without having Google
account? std-proposals page gives me a text-only list of topics without
any help of how to participate. Probably because I have JavaScript disabled.




SG16 list run by herb.sutter at gmail.com