C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Sat, 07 Sep 2019 09:44:42 -0700
On Saturday, 7 September 2019 08:26:00 PDT Lyberta wrote:
> Thiago Macieira:
> > On Friday, 6 September 2019 19:17:00 PDT Lyberta wrote:
> >> I think if the machine-readable output depends on locale, the author of
> >> the program seriously messed up.
> >
> > Oh, I agree with you. The problem is that the standard C library (as
> > extended by POSIX) does not provide the API to make that happen *and*
> > support internationalisation. And that's assuming the tool even have a
> > "machine readable" format in the first place. In the Unix tradition, you
> > just scrape the output of tools.
>
> Then don't use standard C library. On POSIX use open(), read() and
> write(), have your own Unicode layer on top and read/write UTF-8 JSON if
> you want to output anything machine-readable.

Challenge: produce this JSON thread-safely in machine-readable format in C,
with setlocale(LC_ALL, ""); at the top of the file, from input double v = 1.1.

 [ 1.1 ]


> There is no such thing as plain text and Unix philosophy is dead.

Great! Let's drop JSON then.

> I have a C++ proposal for binary IO/serialization here:
>
> https://github.com/Lyberta/cpp-io
>
> It was already reviewed by Niall twice and hopefully by C++23 we'll have
> sane binary IO in the standard. I don't have to plans to fix C at this
> point though because it doesn't have an analog of std::byte yet.

That looks very nice, aside from the obligatory bike-shedding of the class and
namespace names, of course.

> > But the input is not Unicode, it's file paths. On Unix, it is possible to
> > pass binary input in the command-line. With some effort, you can even
> > pass NULs to specially crafted receiver applications. The std::filesystem
> > API appears to have a way to retrieve the native raw format, which some
> > application may need.
> Yeah, it's all because C decided to have char as both bytes and
> characters and doomed us all for ~50 years of pain. We need to decide if
> main() should get text or characters and fix it.

I think the decision was made for us: bytes on Unix, bytes on Windows if Niall
can get the runtimes to use UTF-8, wchar_t otherwise.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-07 18:44:45