C++ Logo

sg16

Advanced search

Re: [SG16] Fast UTF-8 sequence validation

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 19 May 2021 10:34:33 -0400
Copying SG16 (the ISO WG21 C++ standard study group on Unicode and text
processing).

Tom.

On 5/18/21 7:20 PM, Nelson H. F. Beebe via Unicode wrote:
> I recently recorded a BibTeX entry in
>
> http://www.math.utah.edu/pub/tex/bib/unicode.html#Keiser:2021:VUL
> for a new paper that has just been published in a Wiley journal:
>
> Validating UTF-8 in less than one instruction per byte
> Software --- Practice and Experience 51(5) 950--964 May 2021
> https://doi.org/10.1002/spe.2920
>
> A preprint is available at
>
> https://arxiv.org/abs/2010.03090
>
> The authors exploit vector instructions in recent AMD/Intel x86_64 and
> ARM v7 NEON processors to achieve high throughput that in some cases
> exceeds that of the Standard C library function memcpy() for mostly
> ASCII sequences, and for random UTF-8 sequences, runs at 1/4 to 1/2
> the speed of memcpy().
>
> C++ code implementing their work is freely available at
>
> https://github.com/lemire/validateutf8-experiments
>
> and the paper's references contain links to earlier papers on fast
> validation and transformation of Unicode character sequences.
>
> -------------------------------------------------------------------------------
> - Nelson H. F. Beebe Tel: +1 801 581 5254 -
> - University of Utah FAX: +1 801 581 4148 -
> - Department of Mathematics, 110 LCB Internet e-mail: beebe_at_[hidden] -
> - 155 S 1400 E RM 233 beebe_at_[hidden] beebe_at_[hidden] -
> - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
> -------------------------------------------------------------------------------



Received on 2021-05-19 09:34:41