Minutes

On Wed, Apr 7, 2021 at 10:21 AM Michael Wong <fraggamuffin@gmail.com> wrote:

SG19 Machine Learning 2 hours. This session will focus on Stats and Combinatorics but with updates from all the others optionally.

Hi,

Michael Wong is inviting you to a scheduled Zoom meeting.

Topic: SG19 monthly Dec 2020-Feb 2021

Time: Apr 8, 2020 02:00 PM Eastern Time (US and Canada) Stats

Every month on the Second Thu,

May 13, 2020 02:00 PM ET 1900 UTC Reinformaent Learning and Diff Calculus

June 10, 2021 02:00 PM ET 1900 UTC Graph

Jul 8, 2021 02:00 PM ET 1900 UTC Stats and Combinatorics

Please download and import the following iCalendar (.ics) files to your

calendar system.

Monthly:

https://iso.zoom.us/meeting/tJctf-2tpzotGNHL5pZqwtjELee0mcG2zzCi/ics?icsToken=98tyKuCrrjMuH92UtxuCRowqAoqgLO_xmH5ajY11sEr1OTFEdgnTGudHYr98N4rKJoin from PC, Mac, Linux, iOS or Android:

https://iso.zoom.us/j/93084591725?pwd=K3QxZjJlcnljaE13ZWU5cTlLNkx0Zz09

Password: 035530Or iPhone one-tap :

US: +13017158592,,93084591725# or +13126266799,,93084591725#

Or Telephone:

Dial(for higher quality, dial a number based on your current location):

US: +1 301 715 8592 or +1 312 626 6799 or +1 346 248 7799 or +1

408 638 0968 or +1 646 876 9923 or +1 669 900 6833 or +1 253 215 8782

or 877 853 5247 (Toll Free)

Meeting ID: 930 8459 1725

Password: 035530

International numbers available: https://iso.zoom.us/u/agewu4X97Or Skype for Business (Lync):

https://iso.zoom.us/skype/93084591725Agenda:

1. Opening and introductions

The ISO Code of conduct:

https://www.iso.org/files/live/sites/isoorg/files/store/en/PUB100397.pdf

The IEC Code of Conduct:

https://basecamp.iec.ch/download/iec-code-of-conduct-for-delegates-and-experts/ISO patent policy.

https://isotc.iso.org/livelink/livelink/fetch/2000/2122/3770791/Common_Policy.htm?nodeid=6344764&vernum=-2The WG21 Practices and Procedures and Code of Conduct:

https://isocpp.org/std/standing-documents/sd-4-wg21-practices-and-procedures1.1 Roll call of participants

Andrew Lumsdaine, Guy Davidson, Johan Lundberg, Ozran Irsoy, Phil Ratzloff, Scott McMillan, Scott Moe, Will Wray, Michael Wong , Jens Maurer, Rene Rivera, Cyril Khazan,

1.2 Adopt agenda

1.3 Approve minutes from previous meeting, and approve publishing

previously approved minutes to ISOCPP.org1.4 Action items from previous meetings

2. Main issues (125 min)

2.1 General logistics

Meeting plan, focus on one paper per meeting but does not preclude other

paper updates:May 13, 2020 02:00 PM ET 1900 UTC Reinformaent Learning and Diff Calculus

June 10, 2021 02:00 PM ET 1900 UTC Graph

Jul 8, 2021 02:00 PM ET 1900 UTC Stats and CombinatoricsISO meeting status

future C++ Std meetings

2.2 Paper reviews

2.2.1: ML topics

2.2.1.1 Graph Proposal Phil Ratsloff et al

P1709R1: Graph Proposal for Machine Learning

P1709R3:

https://docs.google.com/document/d/1kLHhbSTX7j0tPeTYECQFSNx3R35Mu3xO5_dyYdRy4dM/edit?usp=sharinghttps://docs.google.com/document/d/1QkfDzGyfNQKs86y053M0YHOLP6frzhTJqzg1Ug_vkkE/edit?usp=sharing

<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2119r0.html>

customization point with multiple parameters, how to do that

Eric Niebler for help?

2.2.1.2 Reinforcement Learning Larry Lewis Jorge Silva

Reinforcement Learning proposal:

2.2.1.3 Differential Calculus:

2.2.1.4: Stats paper

Current github

Review by Johan:

> I have four comments> 1.

> Relating to* " Like the weighted quantile,this feature would require that

> the values of a given range either be presorted or sorted as part of the

> computation of a mean. "*

>

> There's no need to sort a whole range to do a trimmed mean or weighted

> median. For the first, it's enough to do two calls to nth_element (fast)

> and benefit from that it partitions out the outliers without need for full

> sorting. Perhaps there's even an algorithm - eg a

> generalization/reformulation of select/introsect that does it faster than

> those two calls (would be interesting to know). In any case it's better

> than full sorting.

>

> I agree that trimmed mean should ideally be done by combining mean with

> general purpose algorithms (like nth element) but I don't know how to do

> that with ranges (I wouldn't - I'm new to them).

>

> There's also a more efficient way to do weighted median without sorting.

> For that I think a specific algorithm would be very useful because it's not

> possible to create it out of composing other existing and proposed

> algorithms as far as I understand. An O(N) solution is hinted to exist at

> c. below

> https://en.wikipedia.org/wiki/Weighted_median#cite_note-:0-1

>

> [image: bild.png]

>

replace sort with use_n element

can we do harmonic mean in a single pass? yes there is a way

accumulator objects can do the trick, these do not calculate median/quantize which need intermediate calculation with unbounded memory

Johan concerned that this may not be best way to do quantized, they dont say anything about numerical stability, this may be unusable in some use cases

1. accumulate have execution policy, so can execute in multiple threads, simd has no order guarantee, (this means result may be unreproducible)

C++ does not guarantee order of calculation

Use Box-Muller transform instead of Gaussian transform

to get control of order of intermediate calculations, then need to roll your own, or impl can build a more stable version so its QOI

Order of complexity mentioned in the paper (should be mostly linear)

>

> 2.

> Another general point (obvious but worth considering in the design or

> discussing in the proposal):* numerical stability can be an issue* with

> many of the defining equations being far from best practice when

> implementing. Eg,

> https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf

>

> It would be good to say something regarding that in the proposal. Perhaps

> closely related to point 3 below.

>

> Again, I trust you are more knowledgeable than me, but even summation is

> sensitive to the order of the values. Mean of (doubles) mean( eps, 30, -30)

> vs mean(30, -30, eps) and similar with higher modes and equations.

>

> So, it would be good if the specification of the methods allow the reader

> to understand how to deal with that. For example, if the sum is

> *specified *to

> be added from the beginning to end to a value of the same type without

> extra variables to deal with numerics, the user could arrange the values in

> an order (for example sort on absolute value) to reduce the effects.

>

> Another general point (obvious but worth considering in the design or

> discussing in the proposal):* numerical stability can be an issue* with

> many of the defining equations being far from best practice when

> implementing. Eg,

> https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf

>

> It would be good to say something regarding that in the proposal. Perhaps

> closely related to point 3 below.

>

> Again, I trust you are more knowledgeable than me, but even summation is

> sensitive to the order of the values. Mean of (doubles) mean( eps, 30, -30)

> vs mean(30, -30, eps) and similar with higher modes and equations.

>

> So, it would be good if the specification of the methods allow the reader

> to understand how to deal with that. For example, if the sum is

> *specified *to

> be added from the beginning to end to a value of the same type without

> extra variables to deal with numerics, the user could arrange the values in

> an order (for example sort on absolute value) to reduce the effects.

>

same as above

> 3.

> When the statistical distributions got into the standard they were not
that

> tightly specified, so we got a few unnecessary things like this (they just

> differ in the output *order* of box-muller)

> https://stackoverflow.com/questions/38532927/why-gcc-and-

> *msvc-stdnormal-distribution-are-different*

>

> It would be good to specify a bit more where it does not hinder the

> implementation. The above seems a bit unnecessary but it's not an easy

> trade-of vs faster or more accurate vs predictable and "same".

>

> 4.

> Standard Deviation and variance

>

> I find it's important to make it possible to specify ddof as is possible

> with numpy <https://numpy.org/doc/stable/reference/generated/numpy.std.html>.

>

>

> Doing that, it's possible to get

> (uncorrected) sample standard deviation (ddof = 0.0 ) your equation 15

> (corrected) sample standard deviation (ddof = 1.0) your equation 16

> (approximated [wikipedia]

> <https://en.wikipedia.org/wiki/Standard_deviation#Unbiased_sample_standard_deviation>)

> Unbiased sample standard deviation (ddof = 1.5)

> tightly specified, so we got a few unnecessary things like this (they just

> differ in the output *order* of box-muller)

> https://stackoverflow.com/questions/38532927/why-gcc-and-

> *msvc-stdnormal-distribution-are-different*

>

> It would be good to specify a bit more where it does not hinder the

> implementation. The above seems a bit unnecessary but it's not an easy

> trade-of vs faster or more accurate vs predictable and "same".

>

> 4.

> Standard Deviation and variance

>

> I find it's important to make it possible to specify ddof as is possible

> with numpy <https://numpy.org/doc/stable/reference/generated/numpy.std.html>.

>

>

> Doing that, it's possible to get

> (uncorrected) sample standard deviation (ddof = 0.0 ) your equation 15

> (corrected) sample standard deviation (ddof = 1.0) your equation 16

> (approximated [wikipedia]

> <https://en.wikipedia.org/wiki/Standard_deviation#Unbiased_sample_standard_deviation>)

> Unbiased sample standard deviation (ddof = 1.5)

seems useful by generalizing Data_T parameter

using the last equation in the above wikipedia article

possibly as a compile time option

also variance and std dev (like numpy)

>

> Naturally, the same with variance (again, as in numpy). sqrt is not free

> even these days.

> even these days.

>

> cheers, Johan Lundberg https://www.linkedin.com/in/johanml/

>

> PS. I'm not sure how to do nth_element piped into mean using ranges. But

> just to clarify what I meant in comment 1. without ranges:

> #include <vector>

> #include <algorithm>

> #include <iostream>

> int main(){

> std::vector<int> v{11,10,7,6,3,1,5,2,4,8,9};

> auto L=v.begin()+2;

> auto H=v.end()-2;

> std::nth_element(v.begin(), L, v.end());

> std::nth_element(L, H, v.end());

> for(; L!=H ; ++L){

> std::cout << *L << " "; // 5 8 4 6 7 3 9

> }

> }

>

> 5.4 Trimmed Mean

> The issue of a trimmed mean is raised in [41]. A (p%)trimmed mean[42] is

> one in which each of thep/2% highestandlowestvalues (of a sorted range) are

> excluded from the computation of that mean. Like the weighted quantile,this

> feature would require that the values of a given range either be presorted or

> sorted as part of the computation of a mean. As an author, Phillip Ratzloff

> feels (a sentiment that was echoed by the author of [41]) that one might

> handle this (and other similar) matter via ranges, specifically by using a

> statement of the form auto result = values | std::ranges::sort | trim(p) |

> std::stats::mean

> cheers, Johan Lundberg https://www.linkedin.com/in/johanml/

>

> PS. I'm not sure how to do nth_element piped into mean using ranges. But

> just to clarify what I meant in comment 1. without ranges:

> #include <vector>

> #include <algorithm>

> #include <iostream>

> int main(){

> std::vector<int> v{11,10,7,6,3,1,5,2,4,8,9};

> auto L=v.begin()+2;

> auto H=v.end()-2;

> std::nth_element(v.begin(), L, v.end());

> std::nth_element(L, H, v.end());

> for(; L!=H ; ++L){

> std::cout << *L << " "; // 5 8 4 6 7 3 9

> }

> }

>

> 5.4 Trimmed Mean

> The issue of a trimmed mean is raised in [41]. A (p%)trimmed mean[42] is

> one in which each of thep/2% highestandlowestvalues (of a sorted range) are

> excluded from the computation of that mean. Like the weighted quantile,this

> feature would require that the values of a given range either be presorted or

> sorted as part of the computation of a mean. As an author, Phillip Ratzloff

> feels (a sentiment that was echoed by the author of [41]) that one might

> handle this (and other similar) matter via ranges, specifically by using a

> statement of the form auto result = values | std::ranges::sort | trim(p) |

> std::stats::mean

Jeff Garland comments:

1. rolling stats and sliding window over the data and do that with accumulator based version instead of recalculating it over and over again

this is not in scope for this proposal, but for future

2. stats error

stats error derive from std exception, dont throw std exception directly

should define class stats_error and show a constructor, what string

3. number type limitations

limited to C++ data type but need custom types, arithmetic types lock out other types

flexibility vs safety

more types enable more flexibility

general types need to say what kind of operations does it support, what the implementation look like

dont forget sq root, or kurtosis

worth looking at it

also consider harmonic mean and trigonometric functions,

associativity commutativity, exec parameter does not guarantee order, not even double and float

JM recommend dont try in first attempt unless there is market pressure

arithmetic allows strong typedef/typeoff to work

serving the user and the implementers (who need to know what ops allowed to call on user supplied type)

can we use arithmetic now, and later relax it?

its just the constraint that is being relaxed, so that more stuff may work in future

make a concept for these things,

but keep the arithmetic constraint

review of Combinatorics D paper

use natural logarithm by Moses/MIT

double type will max out after 127! so its not enough, we use templates for wide integer type

plot value vs accuracy, linear declining

how many digits in the result?

do we need a factorial class which are often divided?

Matlab allows factorials of 20million but is it a limitation of the algorithm?

is there a concept of T? like std library functions that have no concept

Please add JM and SG19 as co-author or acknowledgment to both papers

back to stats:

quantile sorted, unsorted was the most difficult, while mean , median was easy

but unsorted could use n-element

can use on any reference

85% quantile symmetric way

how to round q with the number element? add extra param like numpy to say whether you want both

with unsorted that might add new unknown algorithm

Stats review Richard Dosselman et al

P1708R3: Math proposal for Machine Learning: 3rd review

PXXXX: combinatorics: 1st Review

> std.org/jtc1/sc22/wg21/docs/papers/2020/p1708r2

> above is the stats paper that was reviewed in Prague

> http://wiki.edg.com/bin/view/Wg21prague/P1708R2SG19

>

> Review Jolanta Polish feedback.

> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2119r0.html2.2.3 any other proposal for reviews?

2.3 Other Papers and proposals

P1416R1: SG19 - Linear Algebra for Data Science and Machine Learning

https://docs.google.com/document/d/1IKUNiUhBgRURW-UkspK7fAAyIhfXuMxjk7xKikK4Yp8/edit#heading=h.tj9hitg7dbtrP1415: Machine Learning Layered list

https://docs.google.com/document/d/1elNFdIXWoetbxjO1OKol_Wj8fyi4Z4hogfj5tLVSj64/edit#heading=h.tj9hitg7dbtr2.2.2 SG14 Linear Algebra progress:

Different layers of proposal

https://docs.google.com/document/d/1poXfr7mUPovJC9ZQ5SDVM_1Nb6oYAXlK_d0ljdUAtSQ/edit2.5 Future F2F meetings:

2.6 future C++ Standard meetings:

https://isocpp.org/std/meetings-and-participation/upcoming-meetingsNone

3. Any other business

New reflector

http://lists.isocpp.org/mailman/listinfo.cgi/sg19

Old Reflector

https://groups.google.com/a/isocpp.org/forum/#!newtopic/sg19

<https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!forum/sg14>Code and proposal Staging area

4. Review

4.1 Review and approve resolutions and issues [e.g., changes to SG's

working draft]4.2 Review action items (5 min)

5. Closing process

5.1 Establish next agenda

TBD

5.2 Future meeting

April 8 2021 02:00 PM Eastern Time ( 1800 UTC ): Stats