sg16: Re: [SG16-Unicode] std::byte based I/O library

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 7 Feb 2019 11:33:11 +0100

2/ Yep, work is being done there https://wg21.link/p1439r0
1/ Text _is_ bytes. Unicode specifies BOM and little/big endian versions
so I think prepending a bom ( should you want to ) and byte-swapping can be
done along transcoding

3/ stream don't have an explicit encoding ( or locale, for that matter ),
but they might expect one - think for example about writing to stdout. The
question then, is how do you figure what the peer expects? In the general
case, you might not be able to, and the only solution is for vendors to
default to expect utf8 and c++ to default to output text as utf-8.
Getting everyone on the same page will not be an easy feat.

On Thu, 7 Feb 2019 at 11:20 Lyberta <lyberta_at_[hidden]> wrote:

> Corentin:
> > Ideally, we need 3 separate things:
> >
> > 1/ A way to read/write byte streams
> > 2/ A way to transcode to/from non-unicode encoding
> > 3/ A way to determine the encoding expected by a given stream.
> >
> > The later is in the general case not possible, and it might not be
> > generically possible for simple things like console i/o.
> > I think the only sane way forward is to by default assume utf8 everywhere
> > and work with os vendors to ensure they have the same defaults.
> >
> > I think 1/ falls entirely outside of the scope of SG16
>
> What about transforming text to bytes?
>
> I maintain a C++20 serialization library and have plans to offer it for
> standardization: https://gitlab.com/ftz/serialization
>
> The key problem is that I'm not sure exactly how it would work with text.
>
> Since execution character set and UTF-8 use bytes as code units,
> [de]serializing them is effectively a memcpy. UTF-16 and UTF-32 otoh
> require handling of endianness. My streams store user-supplied
> endianness so there is automatic conversion during IO.
>
> Consider this syntax used by my library:
>
> BinaryOutputStream& stream;
> std::u16string string{ ... };
> stream.SetEndianness(std::endian::big);
> Serialization::Write(stream, string);
>
> In my opinion the default behavior would perform byteswap of each code
> unit in little endian system before writing. What about BOM? That would
> require something explicit. I think the generic way would be to have a
> strong type that have special BOM handling during IO.
>
> On the point 2/. I think this is easily done using ranges-like
> customization points. My serialization library uses them for serializing
> user-defined types. For text conversion a user will simply need to
> customize something like std::to_unicode_code_point and
> std::from_unicode_code_point, the rest of the code will simply use those.
>
> On the point 3/. Byte streams don't have text encoding. It's up to the
> user what encoding to write.
>
> Is there a way for me to reach wider public without having Google
> account? std-proposals page gives me a text-only list of topics without
> any help of how to participate. Probably because I have JavaScript
> disabled.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-02-07 11:33:24