sg19: Re: [SG19] Apr 8 SG19 Zoom

From: Michael Wong <fraggamuffin_at_[hidden]>
Date: Thu, 8 Apr 2021 15:50:05 -0400

Minutes

On Wed, Apr 7, 2021 at 10:21 AM Michael Wong <fraggamuffin_at_[hidden]> wrote:

> SG19 Machine Learning 2 hours. This session will focus on Stats and
> Combinatorics but with updates from all the others optionally.
>
>
> Hi,
>
> Michael Wong is inviting you to a scheduled Zoom meeting.
>
> Topic: SG19 monthly Dec 2020-Feb 2021
> Time: Apr 8, 2020 02:00 PM Eastern Time (US and Canada) Stats
> Every month on the Second Thu,
> May 13, 2020 02:00 PM ET 1900 UTC Reinformaent Learning and Diff
> Calculus
> June 10, 2021 02:00 PM ET 1900 UTC Graph
> Jul 8, 2021 02:00 PM ET 1900 UTC Stats and Combinatorics
> Please download and import the following iCalendar (.ics) files to your
> calendar system.
> Monthly:
>
> https://iso.zoom.us/meeting/tJctf-2tpzotGNHL5pZqwtjELee0mcG2zzCi/ics?icsToken=98tyKuCrrjMuH92UtxuCRowqAoqgLO_xmH5ajY11sEr1OTFEdgnTGudHYr98N4rK
>
> Join from PC, Mac, Linux, iOS or Android:
> https://iso.zoom.us/j/93084591725?pwd=K3QxZjJlcnljaE13ZWU5cTlLNkx0Zz09
> Password: 035530
>
> Or iPhone one-tap :
> US: +13017158592,,93084591725# or +13126266799,,93084591725#
> Or Telephone:
> Dial(for higher quality, dial a number based on your current location):
> US: +1 301 715 8592 or +1 312 626 6799 or +1 346 248 7799 or +1
> 408 638 0968 or +1 646 876 9923 or +1 669 900 6833 or +1 253 215 8782
> or 877 853 5247 (Toll Free)
> Meeting ID: 930 8459 1725
> Password: 035530
> International numbers available: https://iso.zoom.us/u/agewu4X97
>
> Or Skype for Business (Lync):
> https://iso.zoom.us/skype/93084591725
>
> Agenda:
>
> 1. Opening and introductions
>
> The ISO Code of conduct:
> https://www.iso.org/files/live/sites/isoorg/files/store/en/PUB100397.pdf
> The IEC Code of Conduct:
>
> https://basecamp.iec.ch/download/iec-code-of-conduct-for-delegates-and-experts/
>
> ISO patent policy.
>
> https://isotc.iso.org/livelink/livelink/fetch/2000/2122/3770791/Common_Policy.htm?nodeid=6344764&vernum=-2
>
> The WG21 Practices and Procedures and Code of Conduct:
>
> https://isocpp.org/std/standing-documents/sd-4-wg21-practices-and-procedures
>
> 1.1 Roll call of participants
>
Andrew Lumsdaine, Guy Davidson, Johan Lundberg, Ozran Irsoy, Phil Ratzloff,
Scott McMillan, Scott Moe, Will Wray, Michael Wong , Jens Maurer, Rene
Rivera, Cyril Khazan,

1.2 Adopt agenda
>
> 1.3 Approve minutes from previous meeting, and approve publishing
> previously approved minutes to ISOCPP.org
>
> 1.4 Action items from previous meetings
>
> 2. Main issues (125 min)
>
> 2.1 General logistics
>
> Meeting plan, focus on one paper per meeting but does not preclude other
> paper updates:
>
> May 13, 2020 02:00 PM ET 1900 UTC Reinformaent Learning and Diff
> Calculus
> June 10, 2021 02:00 PM ET 1900 UTC Graph
> Jul 8, 2021 02:00 PM ET 1900 UTC Stats and Combinatorics
>
> ISO meeting status
>
> future C++ Std meetings
>
> 2.2 Paper reviews
>
> 2.2.1: ML topics
>
> 2.2.1.1 Graph Proposal Phil Ratsloff et al
>
> P1709R1: Graph Proposal for Machine Learning
>
> P1709R3:
>
> https://docs.google.com/document/d/1kLHhbSTX7j0tPeTYECQFSNx3R35Mu3xO5_dyYdRy4dM/edit?usp=sharing
>
>
> https://docs.google.com/document/d/1QkfDzGyfNQKs86y053M0YHOLP6frzhTJqzg1Ug_vkkE/edit?usp=sharing
>
> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2119r0.html>
>
> <
> https://docs.google.com/document/d/175wIm8o4BNGti0WLq8U6uZORegKVjmnpfc-_E8PoGS0/edit?ts=5fff27cd#heading=h.9ogkehmdmtel
> >
>

customization point with multiple parameters, how to do that
Eric Niebler for help?

2.2.1.2 Reinforcement Learning Larry Lewis Jorge Silva
>
> Reinforcement Learning proposal:
>
> 2.2.1.3 Differential Calculus:
>
>
> https://docs.google.com/document/d/175wIm8o4BNGti0WLq8U6uZORegKVjmnpfc-_E8PoGS0/edit?ts=5fff27cd#heading=h.9ogkehmdmtel
>
>
> 2.2.1.4: Stats paper
>
> Current github
>
> https://github.com/cplusplus/papers/issues/475
>
> https://github.com/cplusplus/papers/issues/979
>

Review by Johan:
> I have four comments
> 1.
> Relating to* " Like the weighted quantile,this feature would require that
> the values of a given range either be presorted or sorted as part of the
> computation of a mean. "*
>
> There's no need to sort a whole range to do a trimmed mean or weighted
> median. For the first, it's enough to do two calls to nth_element (fast)
> and benefit from that it partitions out the outliers without need for full
> sorting. Perhaps there's even an algorithm - eg a
> generalization/reformulation of select/introsect that does it faster than
> those two calls (would be interesting to know). In any case it's better
> than full sorting.
>
> I agree that trimmed mean should ideally be done by combining mean with
> general purpose algorithms (like nth element) but I don't know how to do
> that with ranges (I wouldn't - I'm new to them).
>
> There's also a more efficient way to do weighted median without sorting.
> For that I think a specific algorithm would be very useful because it's
not
> possible to create it out of composing other existing and proposed
> algorithms as far as I understand. An O(N) solution is hinted to exist at
> c. below
> https://en.wikipedia.org/wiki/Weighted_median#cite_note-:0-1
>
> [image: bild.png]
>
replace sort with use_n element

can we do harmonic mean in a single pass? yes there is a way
accumulator objects can do the trick, these do not calculate
median/quantize which need intermediate calculation with unbounded memory
Johan concerned that this may not be best way to do quantized, they dont
say anything about numerical stability, this may be unusable in some use
cases
1. accumulate have execution policy, so can execute in multiple threads,
simd has no order guarantee, (this means result may be unreproducible)
C++ does not guarantee order of calculation
Use Box-Muller transform instead of Gaussian transform
to get control of order of intermediate calculations, then need to roll
your own, or impl can build a more stable version so its QOI

Order of complexity mentioned in the paper (should be mostly linear)

>
> 2.
> Another general point (obvious but worth considering in the design or
> discussing in the proposal):* numerical stability can be an issue* with
> many of the defining equations being far from best practice when
> implementing. Eg,
>
https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf
>
> It would be good to say something regarding that in the proposal. Perhaps
> closely related to point 3 below.
>
> Again, I trust you are more knowledgeable than me, but even summation is
> sensitive to the order of the values. Mean of (doubles) mean( eps, 30,
-30)
> vs mean(30, -30, eps) and similar with higher modes and equations.
>
> So, it would be good if the specification of the methods allow the reader
> to understand how to deal with that. For example, if the sum is
> *specified *to
> be added from the beginning to end to a value of the same type without
> extra variables to deal with numerics, the user could arrange the values
in
> an order (for example sort on absolute value) to reduce the effects.
>
same as above
> 3.
> When the statistical distributions got into the standard they were not
that
> tightly specified, so we got a few unnecessary things like this (they just
> differ in the output *order* of box-muller)
> https://stackoverflow.com/questions/38532927/why-gcc-and-
> *msvc-stdnormal-distribution-are-different*
>
> It would be good to specify a bit more where it does not hinder the
> implementation. The above seems a bit unnecessary but it's not an easy
> trade-of vs faster or more accurate vs predictable and "same".
>
> 4.
> Standard Deviation and variance
>
> I find it's important to make it possible to specify ddof as is possible
> with numpy <
https://numpy.org/doc/stable/reference/generated/numpy.std.html>.
>
>
> Doing that, it's possible to get
> (uncorrected) sample standard deviation (ddof = 0.0 ) your equation 15
> (corrected) sample standard deviation (ddof = 1.0) your equation 16
> (approximated [wikipedia]
> <
https://en.wikipedia.org/wiki/Standard_deviation#Unbiased_sample_standard_deviation
>)
> Unbiased sample standard deviation (ddof = 1.5)
seems useful by generalizing Data_T parameter
using the last equation in the above wikipedia article
possibly as a compile time option
also variance and std dev (like numpy)

>
> Naturally, the same with variance (again, as in numpy). sqrt is not free
> even these days.
>
> cheers, Johan Lundberg https://www.linkedin.com/in/johanml/
>
> PS. I'm not sure how to do nth_element piped into mean using ranges. But
> just to clarify what I meant in comment 1. without ranges:
> #include <vector>
> #include <algorithm>
> #include <iostream>
> int main(){
> std::vector<int> v{11,10,7,6,3,1,5,2,4,8,9};
> auto L=v.begin()+2;
> auto H=v.end()-2;
> std::nth_element(v.begin(), L, v.end());
> std::nth_element(L, H, v.end());
> for(; L!=H ; ++L){
> std::cout << *L << " "; // 5 8 4 6 7 3 9
> }
> }
>
> 5.4 Trimmed Mean
> The issue of a trimmed mean is raised in [41]. A (p%)trimmed mean[42] is
> one in which each of thep/2% highestandlowestvalues (of a sorted range)
are
> excluded from the computation of that mean. Like the weighted
quantile,this
> feature would require that the values of a given range either be
presorted or
> sorted as part of the computation of a mean. As an author, Phillip
Ratzloff
> feels (a sentiment that was echoed by the author of [41]) that one might
> handle this (and other similar) matter via ranges, specifically by using a
> statement of the form auto result = values | std::ranges::sort | trim(p) |
> std::stats::mean

Jeff Garland comments:
1. rolling stats and sliding window over the data and do that with
accumulator based version instead of recalculating it over and over again
this is not in scope for this proposal, but for future
2. stats error
stats error derive from std exception, dont throw std exception directly
should define class stats_error and show a constructor, what string
3. number type limitations
limited to C++ data type but need custom types, arithmetic types lock out
other types
flexibility vs safety
more types enable more flexibility
general types need to say what kind of operations does it support, what the
implementation look like
dont forget sq root, or kurtosis
worth looking at it
also consider harmonic mean and trigonometric functions,
associativity commutativity, exec parameter does not guarantee order, not
even double and float
JM recommend dont try in first attempt unless there is market pressure
arithmetic allows strong typedef/typeoff to work
serving the user and the implementers (who need to know what ops allowed to
call on user supplied type)
can we use arithmetic now, and later relax it?
its just the constraint that is being relaxed, so that more stuff may work
in future
make a concept for these things,
but keep the arithmetic constraint

review of Combinatorics D paper
use natural logarithm by Moses/MIT
double type will max out after 127! so its not enough, we use templates for
wide integer type

plot value vs accuracy, linear declining
how many digits in the result?

do we need a factorial class which are often divided?
Matlab allows factorials of 20million but is it a limitation of the
algorithm?
is there a concept of T? like std library functions that have no concept

Please add JM and SG19 as co-author or acknowledgment to both papers

back to stats:
quantile sorted, unsorted was the most difficult, while mean , median was
easy
but unsorted could use n-element
can use on any reference
85% quantile symmetric way
how to round q with the number element? add extra param like numpy to say
whether you want both
with unsorted that might add new unknown algorithm

> Stats review Richard Dosselman et al
>
> P1708R3: Math proposal for Machine Learning: 3rd review
>
> PXXXX: combinatorics: 1st Review
>
> > std.org/jtc1/sc22/wg21/docs/papers/2020/p1708r2
> > above is the stats paper that was reviewed in Prague
> > http://wiki.edg.com/bin/view/Wg21prague/P1708R2SG19
> >
> > Review Jolanta Polish feedback.
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2119r0.html
>
> 2.2.3 any other proposal for reviews?
>
> 2.3 Other Papers and proposals
>
> P1416R1: SG19 - Linear Algebra for Data Science and Machine Learning
>
> https://docs.google.com/document/d/1IKUNiUhBgRURW-UkspK7fAAyIhfXuMxjk7xKikK4Yp8/edit#heading=h.tj9hitg7dbtr
>
> P1415: Machine Learning Layered list
>
> https://docs.google.com/document/d/1elNFdIXWoetbxjO1OKol_Wj8fyi4Z4hogfj5tLVSj64/edit#heading=h.tj9hitg7dbtr
>
> 2.2.2 SG14 Linear Algebra progress:
> Different layers of proposal
>
> https://docs.google.com/document/d/1poXfr7mUPovJC9ZQ5SDVM_1Nb6oYAXlK_d0ljdUAtSQ/edit
>
> 2.5 Future F2F meetings:
>
> 2.6 future C++ Standard meetings:
> https://isocpp.org/std/meetings-and-participation/upcoming-meetings
>
> None
>
> 3. Any other business
>
> New reflector
>
> http://lists.isocpp.org/mailman/listinfo.cgi/sg19
>
> Old Reflector
> https://groups.google.com/a/isocpp.org/forum/#!newtopic/sg19
> <https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!forum/sg14>
>
> Code and proposal Staging area
>
> 4. Review
>
> 4.1 Review and approve resolutions and issues [e.g., changes to SG's
> working draft]
>
> 4.2 Review action items (5 min)
>
> 5. Closing process
>
> 5.1 Establish next agenda
>
> TBD
>
> 5.2 Future meeting
>
> April 8 2021 02:00 PM Eastern Time ( 1800 UTC ): Stats
>

Received on 2021-04-08 14:50:31