C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] std::byte based I/O library

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 7 Feb 2019 13:23:03 +0100
I think mixing text and binary in the same layer is the fatal flaw of
iostream.
A binary stream that offers more than a bag of bytes would suffer the same
issue as iostream.

Separation of text, locales and bytes at different layers is really
important

If you have a binary_stream object and you can somehow set a bom on it,
which not only is text specific, but worse, Unicode specific, something has
gone wrong

When reading, we can implicitly handle bom and UTF-X -> UTF-Y conversion
and moves the state from the stream to the view

view = stream | asUtfX

when writing, having transcoding handle the bom let you handle the case
where the text already has a bom without introducing a state on the stream.

text | toUtf8BOM >> stream
text | toUtf16BOM(stream.endianness) >> stream

> stdout is a special case
I don't believe it is, and I believe iostream should not have made this
assumption.

Maybe consoles are a special case, and maybe the best way to handle them
would be to have an encoding-aware console API.

But short of that stdin/out deal with bag of bytes and text abstractions
should be
layered on top rather than baked in.

Another solution would be to have both text_stream and binary_stream, but
as you say, cout can be either depending on use cases


Note that everything that applies to text probably applies to everything
else
Serialization of a given arbitrary type on a stream should depend on the
specifics of the application rather than be a property of the stream.

stream.write(std::as_bytes(date));
vs
stream.write(date);

The later is arguably nicer but gives no control to applications over how
things are serialized.



On Thu, 7 Feb 2019 at 12:42 Lyberta <lyberta_at_[hidden]> wrote:

> Corentin:
> > 1/ Text _is_ bytes. Unicode specifies BOM and little/big endian versions
> > so I think prepending a bom ( should you want to ) and byte-swapping can
> be
> > done along transcoding
>
> Maybe separate BOM handling into a format state of the stream. Consider
> this:
>
> enum class std::bom_handling
> {
> none, ///< Do not read or write BOM.
> not_utf8, ///< Read and write BOM only in UTf-16 and UTF-32.
> all ///< Read and write BOM in all 3 encoding forms.
> };
>
> BinaryOutputStream& stream;
> stream.GetFormat().SetBOMHandling(std::bom_handling::all);
>
> This feels like a more elegant design because it separates concerns better.
>
> >
> > 3/ stream don't have an explicit encoding ( or locale, for that matter ),
> > but they might expect one - think for example about writing to stdout.
>
> stdout is a special case and we have a way to handle it - execution
> character set. The only problem is that in order to go from Unicode to
> ECS right now you'd need to use std::codecvt which is a horrible mess.
>
> Still, there are some utilities that treat stdin/stdout as a stream of
> raw bytes so it doesn't always represent a text stream.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-02-07 13:23:16