Date: Wed, 18 Apr 2018 15:26:18 -0400
When discussing the char8_t proposal [1] in EWG at Jacksonville,
concerns were raised regarding our long term direction. The following
was raised as one hypothetical future we might find ourselves in:
We adopt char8_t and the community migrates towards use of u8"",
u8string, u8string_view, etc... for portable handling of UTF-8. At some
point, all relevant compilers migrate to use of UTF-8 as the execution
character encoding, but char retains its current aliasing behavior. We
now have two ways of writing portable UTF-8 based code. At this point,
use of char may be preferred to avoid having to sprinkle 'u8'
everywhere. However, use of char8_t may be preferred for performance
advantages due to its non-aliasing behavior.
The claim that non-aliasing behavior produces (significantly) better
performance seems reasonable, but I'm wondering if we can quantify it in
some reasonable way. My brief searches for papers or benchmarks failed
to identify prior research. If anyone knows of studies that have been
done, I would appreciate a pointer to them; particularly if they have a
focus on string/text processing.
The gcc fork available in the char8_t branch of the github repo at [2]
has an implementation of char8_t that is non-aliasing. We could use
this to conduct some experiments intended to quantify performance
differences. One idea I had was to modify Zach Laine's text library [3]
to support char8_t and then compare the performance of test runs, but
I'm not very familiar with his tests. Another idea was to work with Bob
Steagall to profile his UTF-8 work built with char8_t support; Bob has
been specifically focused on performance benchmarks, so this could be
fruitful. Ideally though, realistic results would best be performed
with code that does both text and non-text intense processing since,
presumably, the non-text processing would benefit by use of non-aliasing
types on the text processing side.
If anyone has suggestions for other experiments to try, I'd like to hear
them.
Tom.
[1]: http://wg21.link/p0482
[2]: https://github.com/tahonermann/gcc/tree/char8_t
[3]: https://github.com/tzlaine/text
concerns were raised regarding our long term direction. The following
was raised as one hypothetical future we might find ourselves in:
We adopt char8_t and the community migrates towards use of u8"",
u8string, u8string_view, etc... for portable handling of UTF-8. At some
point, all relevant compilers migrate to use of UTF-8 as the execution
character encoding, but char retains its current aliasing behavior. We
now have two ways of writing portable UTF-8 based code. At this point,
use of char may be preferred to avoid having to sprinkle 'u8'
everywhere. However, use of char8_t may be preferred for performance
advantages due to its non-aliasing behavior.
The claim that non-aliasing behavior produces (significantly) better
performance seems reasonable, but I'm wondering if we can quantify it in
some reasonable way. My brief searches for papers or benchmarks failed
to identify prior research. If anyone knows of studies that have been
done, I would appreciate a pointer to them; particularly if they have a
focus on string/text processing.
The gcc fork available in the char8_t branch of the github repo at [2]
has an implementation of char8_t that is non-aliasing. We could use
this to conduct some experiments intended to quantify performance
differences. One idea I had was to modify Zach Laine's text library [3]
to support char8_t and then compare the performance of test runs, but
I'm not very familiar with his tests. Another idea was to work with Bob
Steagall to profile his UTF-8 work built with char8_t support; Bob has
been specifically focused on performance benchmarks, so this could be
fruitful. Ideally though, realistic results would best be performed
with code that does both text and non-text intense processing since,
presumably, the non-text processing would benefit by use of non-aliasing
types on the text processing side.
If anyone has suggestions for other experiments to try, I'd like to hear
them.
Tom.
[1]: http://wg21.link/p0482
[2]: https://github.com/tahonermann/gcc/tree/char8_t
[3]: https://github.com/tzlaine/text
Received on 2018-04-18 21:26:26