Minutes

On Wed, Apr 7, 2021 at 10:21 AM Michael Wong <fraggamuffin@gmail.com> wrote:

SG19 Machine Learning 2 hours. This session will focus on Stats and Combinatorics but with updates from all the others optionally.


Hi,

Michael Wong is inviting you to a scheduled Zoom meeting.

Topic: SG19 monthly Dec 2020-Feb 2021
Time: Apr 8, 2020 02:00 PM Eastern Time (US and Canada) Stats
    Every month on the Second Thu, 
    May 13, 2020 02:00 PM ET 1900 UTC Reinformaent Learning and Diff Calculus
    June 10, 2021 02:00 PM ET 1900 UTC Graph
    Jul 8, 2021 02:00 PM ET 1900 UTC Stats and Combinatorics
    Please download and import the following iCalendar (.ics) files to your
calendar system.
    Monthly:
https://iso.zoom.us/meeting/tJctf-2tpzotGNHL5pZqwtjELee0mcG2zzCi/ics?icsToken=98tyKuCrrjMuH92UtxuCRowqAoqgLO_xmH5ajY11sEr1OTFEdgnTGudHYr98N4rK

Join from PC, Mac, Linux, iOS or Android:
https://iso.zoom.us/j/93084591725?pwd=K3QxZjJlcnljaE13ZWU5cTlLNkx0Zz09
    Password: 035530

Or iPhone one-tap :
    US: +13017158592,,93084591725# or +13126266799,,93084591725#
Or Telephone:
    Dial(for higher quality, dial a number based on your current location):
        US: +1 301 715 8592 or +1 312 626 6799 or +1 346 248 7799 or +1
408 638 0968 or +1 646 876 9923 or +1 669 900 6833 or +1 253 215 8782
 or 877 853 5247 (Toll Free)
    Meeting ID: 930 8459 1725
    Password: 035530
    International numbers available: https://iso.zoom.us/u/agewu4X97

Or Skype for Business (Lync):
    https://iso.zoom.us/skype/93084591725

Agenda:

1. Opening and introductions

The ISO Code of conduct:
https://www.iso.org/files/live/sites/isoorg/files/store/en/PUB100397.pdf
The IEC Code of Conduct:
https://basecamp.iec.ch/download/iec-code-of-conduct-for-delegates-and-experts/

ISO patent policy.
https://isotc.iso.org/livelink/livelink/fetch/2000/2122/3770791/Common_Policy.htm?nodeid=6344764&vernum=-2

The WG21 Practices and Procedures and Code of Conduct:
https://isocpp.org/std/standing-documents/sd-4-wg21-practices-and-procedures

1.1 Roll call of participants

Andrew Lumsdaine, Guy Davidson, Johan Lundberg, Ozran Irsoy, Phil Ratzloff, Scott McMillan, Scott Moe, Will Wray, Michael Wong , Jens Maurer, Rene Rivera, Cyril Khazan,

1.2 Adopt agenda

1.3 Approve minutes from previous meeting, and approve publishing
 previously approved minutes to ISOCPP.org

1.4 Action items from previous meetings

2. Main issues (125 min)

2.1 General logistics

Meeting plan, focus on one paper per meeting but does not preclude other
paper updates:

    May 13, 2020 02:00 PM ET 1900 UTC Reinformaent Learning and Diff Calculus
    June 10, 2021 02:00 PM ET 1900 UTC Graph
    Jul 8, 2021 02:00 PM ET 1900 UTC Stats and Combinatorics

ISO meeting status

future C++ Std meetings

2.2 Paper reviews

2.2.1: ML topics

2.2.1.1 Graph Proposal Phil Ratsloff et al

P1709R1: Graph Proposal for Machine Learning

P1709R3:
https://docs.google.com/document/d/1kLHhbSTX7j0tPeTYECQFSNx3R35Mu3xO5_dyYdRy4dM/edit?usp=sharing

https://docs.google.com/document/d/1QkfDzGyfNQKs86y053M0YHOLP6frzhTJqzg1Ug_vkkE/edit?usp=sharing

<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2119r0.html>

<https://docs.google.com/document/d/175wIm8o4BNGti0WLq8U6uZORegKVjmnpfc-_E8PoGS0/edit?ts=5fff27cd#heading=h.9ogkehmdmtel>


customization point with multiple parameters, how to do that
Eric Niebler for help?


2.2.1.2 Reinforcement Learning Larry Lewis Jorge Silva

Reinforcement Learning proposal:

2.2.1.3 Differential Calculus:

https://docs.google.com/document/d/175wIm8o4BNGti0WLq8U6uZORegKVjmnpfc-_E8PoGS0/edit?ts=5fff27cd#heading=h.9ogkehmdmtel


2.2.1.4: Stats paper

Current github

https://github.com/cplusplus/papers/issues/475  

https://github.com/cplusplus/papers/issues/979


Review by Johan:
> I have four comments
> 1.
> Relating to* " Like the weighted quantile,this feature would require that
> the values of a given range either be presorted or sorted as part of the
> computation of a mean. "*
>
> There's no need to sort a whole range to do a trimmed mean or weighted
> median. For the first, it's enough to do two calls to nth_element (fast)
> and benefit from that it partitions out the outliers without need for full
> sorting. Perhaps there's even an algorithm - eg a
> generalization/reformulation of select/introsect that does it faster than
> those two calls (would be interesting to know). In any case it's better
> than full sorting.
>
> I agree that trimmed mean should ideally be done by combining mean with
> general purpose algorithms (like nth element) but I don't know how to do
> that with ranges (I wouldn't - I'm new to them).
>
> There's also a more efficient way to do weighted median without sorting.
> For that I think a specific algorithm would be very useful because it's not
> possible to create it out of composing other existing and proposed
> algorithms as far as I understand. An O(N) solution is hinted to exist at
> c. below
> https://en.wikipedia.org/wiki/Weighted_median#cite_note-:0-1
>
> [image: bild.png]
>
replace sort with use_n element

can we do harmonic mean in a single pass? yes there is a way
accumulator objects can do the trick, these do not calculate median/quantize which need intermediate calculation with unbounded memory
Johan concerned that this may not be best way to do quantized, they dont say anything about numerical stability, this may be unusable in some use cases
1. accumulate have execution policy, so can execute in multiple threads, simd has no order guarantee, (this means result may be unreproducible)
C++ does not guarantee order of calculation
Use Box-Muller transform instead of Gaussian transform
to get control of order of intermediate calculations, then need to roll your own, or impl can build a more stable version so its QOI

Order of complexity mentioned in the paper (should be mostly linear)

>
> 2.
> Another general point (obvious but worth considering in the design or
> discussing in the proposal):* numerical stability can be an issue* with
> many of the defining equations being far from best practice when
> implementing. Eg,
> https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf
>
> It would be good to say something regarding that in the proposal. Perhaps
> closely related to point 3 below.
>
> Again, I trust you are more knowledgeable than me, but even summation is
> sensitive to the order of the values. Mean of (doubles) mean( eps, 30, -30)
> vs mean(30, -30, eps) and similar with higher modes and equations.
>
> So, it would be good if the specification of the methods allow the reader
> to understand how to deal with that. For example, if the sum is  
> *specified *to
> be added from the beginning to end to a value of the same type without
> extra variables to deal with numerics, the user could arrange the values in
> an order (for example sort on absolute value) to reduce the effects.
>
same as above
> 3.
> When the statistical distributions got into the standard they were not that
> tightly specified, so we got a few unnecessary things like this (they just
> differ in the output *order* of box-muller)
> https://stackoverflow.com/questions/38532927/why-gcc-and-
> *msvc-stdnormal-distribution-are-different*
>
> It would be good to specify a bit more where it does not hinder the
> implementation. The above seems a bit unnecessary but it's not an easy
> trade-of vs faster or more accurate vs predictable and "same".
>
> 4.
> Standard Deviation and variance
>
> I find it's important to make it possible to specify ddof as is possible
> with numpy <https://numpy.org/doc/stable/reference/generated/numpy.std.html>.
>
>
> Doing that, it's possible to get
> (uncorrected) sample standard deviation (ddof = 0.0 )   your equation 15
> (corrected) sample standard deviation (ddof = 1.0)   your equation 16
> (approximated [wikipedia]
> <https://en.wikipedia.org/wiki/Standard_deviation#Unbiased_sample_standard_deviation>)
> Unbiased sample standard deviation (ddof = 1.5)
seems useful by generalizing Data_T parameter
using the last equation in the above wikipedia article
possibly as a compile time option
also variance and std dev (like numpy)

>
> Naturally, the same with variance (again, as in numpy). sqrt is not free
> even these days.
>
>   cheers, Johan Lundberg https://www.linkedin.com/in/johanml/
>
> PS. I'm not sure how to do nth_element piped into mean using ranges. But
> just to clarify what I meant in comment 1. without ranges:
> #include <vector>
> #include <algorithm>
> #include <iostream>
> int main(){
>     std::vector<int> v{11,10,7,6,3,1,5,2,4,8,9};
>     auto L=v.begin()+2;
>     auto H=v.end()-2;
>     std::nth_element(v.begin(), L, v.end());
>     std::nth_element(L, H, v.end());
>     for(; L!=H ; ++L){
>         std::cout << *L << " "; // 5 8 4 6 7 3 9
>     }
> }
>
> 5.4 Trimmed Mean
> The issue of a trimmed mean is raised in [41]. A (p%)trimmed mean[42] is
> one in which each of thep/2% highestandlowestvalues (of a sorted range) are
> excluded from the computation of that mean. Like the weighted quantile,this
> feature would require that the values of a given range either be presorted or
> sorted as part of the computation of a mean. As an author, Phillip Ratzloff
> feels (a sentiment that was echoed by the author of [41]) that one might
> handle this (and other similar) matter via ranges, specifically by using a
> statement of the form auto result = values | std::ranges::sort | trim(p) |
> std::stats::mean

 Jeff Garland comments:
1. rolling stats and sliding window over the data and do that with accumulator based version instead of recalculating it over and over again
this is not in scope for this proposal, but for future
2. stats error
stats error derive from std exception, dont throw std exception directly
should define class stats_error and show a constructor, what string
3. number type limitations
limited to C++ data type but need custom types, arithmetic types lock out other types
flexibility vs safety
more types enable more flexibility
general types need to say what kind of operations does it support, what the implementation look like
dont forget sq root, or kurtosis
worth looking at it
also consider harmonic mean and trigonometric functions,
associativity commutativity, exec parameter does not guarantee order, not even double and float
JM recommend dont try in first attempt unless there is market pressure
arithmetic allows strong typedef/typeoff to work
serving the user and the implementers (who need to know what ops allowed to call on user supplied type)
can we use arithmetic now, and later relax it?
its just the constraint that is being relaxed, so that more stuff may work in future
make a concept for these things,
but keep the arithmetic constraint


review of Combinatorics D paper
use natural logarithm by Moses/MIT
double type will max out after 127! so its not enough, we use templates for wide integer type

plot value vs accuracy, linear declining
how many digits in the result?

do we need a factorial class which are often divided?
Matlab allows factorials of 20million but is it a limitation of the algorithm?
is there a concept of T? like std library functions that have no concept

Please add JM and SG19 as co-author or acknowledgment to both papers

back to stats:
quantile sorted, unsorted was the most difficult, while mean , median was easy
but unsorted could use n-element
can use on any reference
85% quantile symmetric way
how to round q with the number element? add extra param like numpy to say whether you want both
with unsorted that might add new unknown algorithm








Stats review Richard Dosselman et al

P1708R3: Math proposal for Machine Learning: 3rd review

PXXXX: combinatorics: 1st Review

> std.org/jtc1/sc22/wg21/docs/papers/2020/p1708r2
> above is the stats paper that was reviewed in Prague
http://wiki.edg.com/bin/view/Wg21prague/P1708R2SG19
>
> Review Jolanta Polish feedback.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2119r0.html

2.2.3 any other proposal for reviews?

2.3 Other Papers and proposals

P1416R1: SG19 - Linear Algebra for Data Science and Machine Learning
https://docs.google.com/document/d/1IKUNiUhBgRURW-UkspK7fAAyIhfXuMxjk7xKikK4Yp8/edit#heading=h.tj9hitg7dbtr

P1415: Machine Learning Layered list
https://docs.google.com/document/d/1elNFdIXWoetbxjO1OKol_Wj8fyi4Z4hogfj5tLVSj64/edit#heading=h.tj9hitg7dbtr

2.2.2 SG14 Linear Algebra progress:
Different layers of proposal
https://docs.google.com/document/d/1poXfr7mUPovJC9ZQ5SDVM_1Nb6oYAXlK_d0ljdUAtSQ/edit

2.5 Future F2F meetings:

2.6 future C++ Standard meetings:
https://isocpp.org/std/meetings-and-participation/upcoming-meetings

None

3. Any other business

New reflector

http://lists.isocpp.org/mailman/listinfo.cgi/sg19

Old Reflector
https://groups.google.com/a/isocpp.org/forum/#!newtopic/sg19
<https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!forum/sg14>

Code and proposal Staging area

4. Review

4.1 Review and approve resolutions and issues [e.g., changes to SG's
working draft]

4.2 Review action items (5 min)

5. Closing process

5.1 Establish next agenda

TBD

5.2 Future meeting

    April 8 2021 02:00 PM Eastern Time ( 1800 UTC ): Stats