Date: Mon, 03 May 2021 08:57:18 -0600

Morning.

Thanks for your very detailed feedback. I will have a better look at

your comments when I have an available moment.

Thanks,

Richard

Quoting Oleksandr Koval <oleksandr.koval.dev_at_[hidden]>:

> Hi, I needed similar stats functions in my project so I tried to implement

> proposed free functions from P1708R4. I faced several

> problems/typos/questions, maybe my experience can be useful for the future

> development of this proposal.

> General things

>

> 1.

>

> Since C++20 constraint for ExecutionPolicy algorithms has been updated to

> std::is_execution_policy_v<std::remove_cvref_t<ExecutionPolicy>> instead

> of std::is_execution_policy_v<std::decay_t<ExecutionPolicy>>.

> 2.

>

> Shouldn't stats_error have const char*/const

> std::string&/string_view constructor

> to report what?

> 3.

>

> I believe that various(if any) throwing checks are redundant. Range size

> checks are cheap only for random access iterators. No existing algorithm

> throws when range(s) has bad size. The common case here is when values and

> weights ranges have different sizes, I think we can just leave it UB.

> Another one is when range has 0 or < N elements. In case of mean() we

> can just divide 0/0 and get inf for empty range. No exceptions are

> needed. Range value checks are expensive, since we're dealing

> with floating

> point types there's no problem with meaningless operations like

> square root

> of negative value.

> 4.

>

> Use std::conditional_t in function signatures instead of

> conditional<>::type.

>

>

> Iterator kind requirements

>

> 1.

>

> All existing algorithms with ExecutionPolicy support require forward

> iterator, not input iterator.

> 2.

>

> Review iterator requirements. For some functions it's very hard if

> possible at all to support input iterator. For example, unweighted

> sorted_quantile(r,

> q, proj, interpolation). All it needs is the size of the range to select

> proper element(s). For input iterator it needs to iterate over all

> elements just to determine range size, but it's not possible to select

> proper element(s) afterward because iterators are not valid anymore. The

> only option I can see here is to copy the whole range into vector, such

> an algorithm has O(n) space complexity.

>

> Mean

>

> 1. I think it would be nice to provide some overflow guarantees to

> encourage implementations to avoid it. For example, use the recurrence

> relation formula for mean(): mean(n) = mean(n-1) + (data[n] -

> mean(n-1))/(n+1). Maybe it's better to provide specific formulas for

> each function instead of general ones in the Part 3 of this proposal.

> Another option is to specify under what conditions overflow is allowed.

>

> Quantile

>

> 1.

>

> Make interpolation_t a scoped enum and remove trailing _t from

> enumerator names.

> 2.

>

> Typo, it still returns tuple instead of Result although interpolation is

> used.

> 3.

>

> Probably a typo, since interpolation is used, constraint for elements

> should be updated. Here's the fixed version of (1):

>

> template<typename R, typename P = std::identity, typename Q,

> // result_t is just an alias for `conditional_t<...>`

> typename Result = result_t<R, P>>// `stats_frange` is like

> `stats_range` but requires forward_iteratorrequires stats_frange<R, P>

> && std::is_floating_point_v<Q>constexpr auto sorted_quantile(R&& r, Q

> q, P proj = {},

> interpolation_t interpolation = interpolation_t::linear) -> Result;

>

>

> 1.

>

> Consider returning iterators instead of potential interpolation of pure

> values. Interpolation with existing modes(higher, lower, etc.) can be

> provided as a separate function. For example, std::max_element() is also

> supposed to be a numeric function, but it returns an iterator.

> One approach

> for this is to return std::pair<std::pair<range_iter, q_value>,

> std::pair<range_iter, q_value>> for unweighted versions and

> std::pair<std::pair<range_iter,

> weight_sum>, std::pair<range_iter, weight_sum>> for weighted ones. p.first

> == end means *no answer*, p.first == p.second means *single*

> answer, p.first

> != p.second means *interpolation* is needed.

> 2.

>

> Related to issue 6. How to write ExecutionPolicy version for an

> unweighted sorted quantile/median algorithm? All it does is finds range

> size and then reads one or two values, there are no parallel versions for

> std::ranges::distance() and std::ranges::advance().

> 3.

>

> Maybe it's obvious but specify that weighted quantiles can be >= 1.

> Because in section 3.2 there is 0 <= q <= 1 requirement.

> 4.

>

> Clarify unsorted_ versions. One option to implement them is in terms of

> nth_element() which mutates the original sequence. Another option is to

> use additional vector of indices and sort them, but it requires O(n)

> space.

> 5.

>

> What's the purpose of typename T = double in unweighted versions? It can

> be just typename T.

> 6.

>

> Why is there a version sorted_quantile(r, q, interpolation) without

> projection? I think that wherever we have any range, we should have

> projection for it as well, and versions without projections should be

> completely removed. The same for sorted_quantiles() et al., there should

> be projections for values, quantiles and weights ranges.

> 7.

>

> How about unsorted_quantiles()? When the number of quantiles is small,

> it can be more efficient than sorting and using sorted_ versions.

> 8.

>

> Clarify weighted sorted_ algorithms. In particular, the case when

> weights are zeros. For example, weighted median for weights

> {1,0,1,0} should

> be interpolation of 1st and 3rd elements. Clarify what should be done in

> the case when weight of the first element is already more than target

> quantile. I suppose that in such a case there's nothing to interpolate and

> the first element should be returned. Think about the complexity of the

> unsorted version of this approach. It might require up to N-1 additional

> calls to nth_element() in case when a lot of elements have zero weight.

>

> I haven't implemented the following functions, hence, not so much I can say

> for them, only minor/interface issues.

> Mode

>

> 1.

>

> Returning through the output iterator seems extremely inconvenient. What

> if the client needs only one mode or at most 3 modes? Maybe it will be

> better to provide 3 kinds of functions: return through output iterator,

> return by value, return through output range. In the last case, function

> can put more modes while range is not full.

> 2.

>

> Why does it require equality_comparable<iter_value<R>> if we need to

> compare projected<R,P>? Seems that the idea was to store iter_value in

> the hash map but it can be very inefficient. Why does it require

> std::output_iterator<O,

> std::iter_value_t<R>? Why does it output non-projected values? I think

> that it should either operate/output on projected values or on iterators

> (not pointed values!) to non-projected values. The latter will require at

> least forward iterator. The same applies to weighted version, why require

> std::equality_comparable<std::iter_value_t<Weights>>?

> 3.

>

> Why does (2, 4) require exactly std::unordered_map. Is it possible to

> allow a custom hash map and specify what operations it needs to support?

>

> Skewness

>

> 1. Make data_t a scoped enumeration. Choose a better name for it, remove

> trailing _t from enumerator names.

>

> Kurtosis

>

> 1. Make excess param a scoped enumeration instead of bool.

>

> Variance

>

> 1. Is var() really better than variance()?

>

>

> --

> Regards,

> Oleksandr Koval.

>

-----------------------------------------------------

This e-mail was sent via the Secure Web Server at the

University of Regina, Department of Computer Science

https://www.cs.uregina.ca/WebMail/

Thanks for your very detailed feedback. I will have a better look at

your comments when I have an available moment.

Thanks,

Richard

Quoting Oleksandr Koval <oleksandr.koval.dev_at_[hidden]>:

> Hi, I needed similar stats functions in my project so I tried to implement

> proposed free functions from P1708R4. I faced several

> problems/typos/questions, maybe my experience can be useful for the future

> development of this proposal.

> General things

>

> 1.

>

> Since C++20 constraint for ExecutionPolicy algorithms has been updated to

> std::is_execution_policy_v<std::remove_cvref_t<ExecutionPolicy>> instead

> of std::is_execution_policy_v<std::decay_t<ExecutionPolicy>>.

> 2.

>

> Shouldn't stats_error have const char*/const

> std::string&/string_view constructor

> to report what?

> 3.

>

> I believe that various(if any) throwing checks are redundant. Range size

> checks are cheap only for random access iterators. No existing algorithm

> throws when range(s) has bad size. The common case here is when values and

> weights ranges have different sizes, I think we can just leave it UB.

> Another one is when range has 0 or < N elements. In case of mean() we

> can just divide 0/0 and get inf for empty range. No exceptions are

> needed. Range value checks are expensive, since we're dealing

> with floating

> point types there's no problem with meaningless operations like

> square root

> of negative value.

> 4.

>

> Use std::conditional_t in function signatures instead of

> conditional<>::type.

>

>

> Iterator kind requirements

>

> 1.

>

> All existing algorithms with ExecutionPolicy support require forward

> iterator, not input iterator.

> 2.

>

> Review iterator requirements. For some functions it's very hard if

> possible at all to support input iterator. For example, unweighted

> sorted_quantile(r,

> q, proj, interpolation). All it needs is the size of the range to select

> proper element(s). For input iterator it needs to iterate over all

> elements just to determine range size, but it's not possible to select

> proper element(s) afterward because iterators are not valid anymore. The

> only option I can see here is to copy the whole range into vector, such

> an algorithm has O(n) space complexity.

>

> Mean

>

> 1. I think it would be nice to provide some overflow guarantees to

> encourage implementations to avoid it. For example, use the recurrence

> relation formula for mean(): mean(n) = mean(n-1) + (data[n] -

> mean(n-1))/(n+1). Maybe it's better to provide specific formulas for

> each function instead of general ones in the Part 3 of this proposal.

> Another option is to specify under what conditions overflow is allowed.

>

> Quantile

>

> 1.

>

> Make interpolation_t a scoped enum and remove trailing _t from

> enumerator names.

> 2.

>

> Typo, it still returns tuple instead of Result although interpolation is

> used.

> 3.

>

> Probably a typo, since interpolation is used, constraint for elements

> should be updated. Here's the fixed version of (1):

>

> template<typename R, typename P = std::identity, typename Q,

> // result_t is just an alias for `conditional_t<...>`

> typename Result = result_t<R, P>>// `stats_frange` is like

> `stats_range` but requires forward_iteratorrequires stats_frange<R, P>

> && std::is_floating_point_v<Q>constexpr auto sorted_quantile(R&& r, Q

> q, P proj = {},

> interpolation_t interpolation = interpolation_t::linear) -> Result;

>

>

> 1.

>

> Consider returning iterators instead of potential interpolation of pure

> values. Interpolation with existing modes(higher, lower, etc.) can be

> provided as a separate function. For example, std::max_element() is also

> supposed to be a numeric function, but it returns an iterator.

> One approach

> for this is to return std::pair<std::pair<range_iter, q_value>,

> std::pair<range_iter, q_value>> for unweighted versions and

> std::pair<std::pair<range_iter,

> weight_sum>, std::pair<range_iter, weight_sum>> for weighted ones. p.first

> == end means *no answer*, p.first == p.second means *single*

> answer, p.first

> != p.second means *interpolation* is needed.

> 2.

>

> Related to issue 6. How to write ExecutionPolicy version for an

> unweighted sorted quantile/median algorithm? All it does is finds range

> size and then reads one or two values, there are no parallel versions for

> std::ranges::distance() and std::ranges::advance().

> 3.

>

> Maybe it's obvious but specify that weighted quantiles can be >= 1.

> Because in section 3.2 there is 0 <= q <= 1 requirement.

> 4.

>

> Clarify unsorted_ versions. One option to implement them is in terms of

> nth_element() which mutates the original sequence. Another option is to

> use additional vector of indices and sort them, but it requires O(n)

> space.

> 5.

>

> What's the purpose of typename T = double in unweighted versions? It can

> be just typename T.

> 6.

>

> Why is there a version sorted_quantile(r, q, interpolation) without

> projection? I think that wherever we have any range, we should have

> projection for it as well, and versions without projections should be

> completely removed. The same for sorted_quantiles() et al., there should

> be projections for values, quantiles and weights ranges.

> 7.

>

> How about unsorted_quantiles()? When the number of quantiles is small,

> it can be more efficient than sorting and using sorted_ versions.

> 8.

>

> Clarify weighted sorted_ algorithms. In particular, the case when

> weights are zeros. For example, weighted median for weights

> {1,0,1,0} should

> be interpolation of 1st and 3rd elements. Clarify what should be done in

> the case when weight of the first element is already more than target

> quantile. I suppose that in such a case there's nothing to interpolate and

> the first element should be returned. Think about the complexity of the

> unsorted version of this approach. It might require up to N-1 additional

> calls to nth_element() in case when a lot of elements have zero weight.

>

> I haven't implemented the following functions, hence, not so much I can say

> for them, only minor/interface issues.

> Mode

>

> 1.

>

> Returning through the output iterator seems extremely inconvenient. What

> if the client needs only one mode or at most 3 modes? Maybe it will be

> better to provide 3 kinds of functions: return through output iterator,

> return by value, return through output range. In the last case, function

> can put more modes while range is not full.

> 2.

>

> Why does it require equality_comparable<iter_value<R>> if we need to

> compare projected<R,P>? Seems that the idea was to store iter_value in

> the hash map but it can be very inefficient. Why does it require

> std::output_iterator<O,

> std::iter_value_t<R>? Why does it output non-projected values? I think

> that it should either operate/output on projected values or on iterators

> (not pointed values!) to non-projected values. The latter will require at

> least forward iterator. The same applies to weighted version, why require

> std::equality_comparable<std::iter_value_t<Weights>>?

> 3.

>

> Why does (2, 4) require exactly std::unordered_map. Is it possible to

> allow a custom hash map and specify what operations it needs to support?

>

> Skewness

>

> 1. Make data_t a scoped enumeration. Choose a better name for it, remove

> trailing _t from enumerator names.

>

> Kurtosis

>

> 1. Make excess param a scoped enumeration instead of bool.

>

> Variance

>

> 1. Is var() really better than variance()?

>

>

> --

> Regards,

> Oleksandr Koval.

>

-----------------------------------------------------

This e-mail was sent via the Secure Web Server at the

University of Regina, Department of Computer Science

https://www.cs.uregina.ca/WebMail/

Received on 2021-05-03 09:57:48