Date: Mon, 03 May 2021 08:57:18 -0600
Morning.
Thanks for your very detailed feedback. I will have a better look at
your comments when I have an available moment.
Thanks,
Richard
Quoting Oleksandr Koval <oleksandr.koval.dev_at_[hidden]>:
> Hi, I needed similar stats functions in my project so I tried to implement
> proposed free functions from P1708R4. I faced several
> problems/typos/questions, maybe my experience can be useful for the future
> development of this proposal.
> General things
>
> 1.
>
> Since C++20 constraint for ExecutionPolicy algorithms has been updated to
> std::is_execution_policy_v<std::remove_cvref_t<ExecutionPolicy>> instead
> of std::is_execution_policy_v<std::decay_t<ExecutionPolicy>>.
> 2.
>
> Shouldn't stats_error have const char*/const
> std::string&/string_view constructor
> to report what?
> 3.
>
> I believe that various(if any) throwing checks are redundant. Range size
> checks are cheap only for random access iterators. No existing algorithm
> throws when range(s) has bad size. The common case here is when values and
> weights ranges have different sizes, I think we can just leave it UB.
> Another one is when range has 0 or < N elements. In case of mean() we
> can just divide 0/0 and get inf for empty range. No exceptions are
> needed. Range value checks are expensive, since we're dealing
> with floating
> point types there's no problem with meaningless operations like
> square root
> of negative value.
> 4.
>
> Use std::conditional_t in function signatures instead of
> conditional<>::type.
>
>
> Iterator kind requirements
>
> 1.
>
> All existing algorithms with ExecutionPolicy support require forward
> iterator, not input iterator.
> 2.
>
> Review iterator requirements. For some functions it's very hard if
> possible at all to support input iterator. For example, unweighted
> sorted_quantile(r,
> q, proj, interpolation). All it needs is the size of the range to select
> proper element(s). For input iterator it needs to iterate over all
> elements just to determine range size, but it's not possible to select
> proper element(s) afterward because iterators are not valid anymore. The
> only option I can see here is to copy the whole range into vector, such
> an algorithm has O(n) space complexity.
>
> Mean
>
> 1. I think it would be nice to provide some overflow guarantees to
> encourage implementations to avoid it. For example, use the recurrence
> relation formula for mean(): mean(n) = mean(n-1) + (data[n] -
> mean(n-1))/(n+1). Maybe it's better to provide specific formulas for
> each function instead of general ones in the Part 3 of this proposal.
> Another option is to specify under what conditions overflow is allowed.
>
> Quantile
>
> 1.
>
> Make interpolation_t a scoped enum and remove trailing _t from
> enumerator names.
> 2.
>
> Typo, it still returns tuple instead of Result although interpolation is
> used.
> 3.
>
> Probably a typo, since interpolation is used, constraint for elements
> should be updated. Here's the fixed version of (1):
>
> template<typename R, typename P = std::identity, typename Q,
> // result_t is just an alias for `conditional_t<...>`
> typename Result = result_t<R, P>>// `stats_frange` is like
> `stats_range` but requires forward_iteratorrequires stats_frange<R, P>
> && std::is_floating_point_v<Q>constexpr auto sorted_quantile(R&& r, Q
> q, P proj = {},
> interpolation_t interpolation = interpolation_t::linear) -> Result;
>
>
> 1.
>
> Consider returning iterators instead of potential interpolation of pure
> values. Interpolation with existing modes(higher, lower, etc.) can be
> provided as a separate function. For example, std::max_element() is also
> supposed to be a numeric function, but it returns an iterator.
> One approach
> for this is to return std::pair<std::pair<range_iter, q_value>,
> std::pair<range_iter, q_value>> for unweighted versions and
> std::pair<std::pair<range_iter,
> weight_sum>, std::pair<range_iter, weight_sum>> for weighted ones. p.first
> == end means *no answer*, p.first == p.second means *single*
> answer, p.first
> != p.second means *interpolation* is needed.
> 2.
>
> Related to issue 6. How to write ExecutionPolicy version for an
> unweighted sorted quantile/median algorithm? All it does is finds range
> size and then reads one or two values, there are no parallel versions for
> std::ranges::distance() and std::ranges::advance().
> 3.
>
> Maybe it's obvious but specify that weighted quantiles can be >= 1.
> Because in section 3.2 there is 0 <= q <= 1 requirement.
> 4.
>
> Clarify unsorted_ versions. One option to implement them is in terms of
> nth_element() which mutates the original sequence. Another option is to
> use additional vector of indices and sort them, but it requires O(n)
> space.
> 5.
>
> What's the purpose of typename T = double in unweighted versions? It can
> be just typename T.
> 6.
>
> Why is there a version sorted_quantile(r, q, interpolation) without
> projection? I think that wherever we have any range, we should have
> projection for it as well, and versions without projections should be
> completely removed. The same for sorted_quantiles() et al., there should
> be projections for values, quantiles and weights ranges.
> 7.
>
> How about unsorted_quantiles()? When the number of quantiles is small,
> it can be more efficient than sorting and using sorted_ versions.
> 8.
>
> Clarify weighted sorted_ algorithms. In particular, the case when
> weights are zeros. For example, weighted median for weights
> {1,0,1,0} should
> be interpolation of 1st and 3rd elements. Clarify what should be done in
> the case when weight of the first element is already more than target
> quantile. I suppose that in such a case there's nothing to interpolate and
> the first element should be returned. Think about the complexity of the
> unsorted version of this approach. It might require up to N-1 additional
> calls to nth_element() in case when a lot of elements have zero weight.
>
> I haven't implemented the following functions, hence, not so much I can say
> for them, only minor/interface issues.
> Mode
>
> 1.
>
> Returning through the output iterator seems extremely inconvenient. What
> if the client needs only one mode or at most 3 modes? Maybe it will be
> better to provide 3 kinds of functions: return through output iterator,
> return by value, return through output range. In the last case, function
> can put more modes while range is not full.
> 2.
>
> Why does it require equality_comparable<iter_value<R>> if we need to
> compare projected<R,P>? Seems that the idea was to store iter_value in
> the hash map but it can be very inefficient. Why does it require
> std::output_iterator<O,
> std::iter_value_t<R>? Why does it output non-projected values? I think
> that it should either operate/output on projected values or on iterators
> (not pointed values!) to non-projected values. The latter will require at
> least forward iterator. The same applies to weighted version, why require
> std::equality_comparable<std::iter_value_t<Weights>>?
> 3.
>
> Why does (2, 4) require exactly std::unordered_map. Is it possible to
> allow a custom hash map and specify what operations it needs to support?
>
> Skewness
>
> 1. Make data_t a scoped enumeration. Choose a better name for it, remove
> trailing _t from enumerator names.
>
> Kurtosis
>
> 1. Make excess param a scoped enumeration instead of bool.
>
> Variance
>
> 1. Is var() really better than variance()?
>
>
> --
> Regards,
> Oleksandr Koval.
>
-----------------------------------------------------
This e-mail was sent via the Secure Web Server at the
University of Regina, Department of Computer Science
https://www.cs.uregina.ca/WebMail/
Thanks for your very detailed feedback. I will have a better look at
your comments when I have an available moment.
Thanks,
Richard
Quoting Oleksandr Koval <oleksandr.koval.dev_at_[hidden]>:
> Hi, I needed similar stats functions in my project so I tried to implement
> proposed free functions from P1708R4. I faced several
> problems/typos/questions, maybe my experience can be useful for the future
> development of this proposal.
> General things
>
> 1.
>
> Since C++20 constraint for ExecutionPolicy algorithms has been updated to
> std::is_execution_policy_v<std::remove_cvref_t<ExecutionPolicy>> instead
> of std::is_execution_policy_v<std::decay_t<ExecutionPolicy>>.
> 2.
>
> Shouldn't stats_error have const char*/const
> std::string&/string_view constructor
> to report what?
> 3.
>
> I believe that various(if any) throwing checks are redundant. Range size
> checks are cheap only for random access iterators. No existing algorithm
> throws when range(s) has bad size. The common case here is when values and
> weights ranges have different sizes, I think we can just leave it UB.
> Another one is when range has 0 or < N elements. In case of mean() we
> can just divide 0/0 and get inf for empty range. No exceptions are
> needed. Range value checks are expensive, since we're dealing
> with floating
> point types there's no problem with meaningless operations like
> square root
> of negative value.
> 4.
>
> Use std::conditional_t in function signatures instead of
> conditional<>::type.
>
>
> Iterator kind requirements
>
> 1.
>
> All existing algorithms with ExecutionPolicy support require forward
> iterator, not input iterator.
> 2.
>
> Review iterator requirements. For some functions it's very hard if
> possible at all to support input iterator. For example, unweighted
> sorted_quantile(r,
> q, proj, interpolation). All it needs is the size of the range to select
> proper element(s). For input iterator it needs to iterate over all
> elements just to determine range size, but it's not possible to select
> proper element(s) afterward because iterators are not valid anymore. The
> only option I can see here is to copy the whole range into vector, such
> an algorithm has O(n) space complexity.
>
> Mean
>
> 1. I think it would be nice to provide some overflow guarantees to
> encourage implementations to avoid it. For example, use the recurrence
> relation formula for mean(): mean(n) = mean(n-1) + (data[n] -
> mean(n-1))/(n+1). Maybe it's better to provide specific formulas for
> each function instead of general ones in the Part 3 of this proposal.
> Another option is to specify under what conditions overflow is allowed.
>
> Quantile
>
> 1.
>
> Make interpolation_t a scoped enum and remove trailing _t from
> enumerator names.
> 2.
>
> Typo, it still returns tuple instead of Result although interpolation is
> used.
> 3.
>
> Probably a typo, since interpolation is used, constraint for elements
> should be updated. Here's the fixed version of (1):
>
> template<typename R, typename P = std::identity, typename Q,
> // result_t is just an alias for `conditional_t<...>`
> typename Result = result_t<R, P>>// `stats_frange` is like
> `stats_range` but requires forward_iteratorrequires stats_frange<R, P>
> && std::is_floating_point_v<Q>constexpr auto sorted_quantile(R&& r, Q
> q, P proj = {},
> interpolation_t interpolation = interpolation_t::linear) -> Result;
>
>
> 1.
>
> Consider returning iterators instead of potential interpolation of pure
> values. Interpolation with existing modes(higher, lower, etc.) can be
> provided as a separate function. For example, std::max_element() is also
> supposed to be a numeric function, but it returns an iterator.
> One approach
> for this is to return std::pair<std::pair<range_iter, q_value>,
> std::pair<range_iter, q_value>> for unweighted versions and
> std::pair<std::pair<range_iter,
> weight_sum>, std::pair<range_iter, weight_sum>> for weighted ones. p.first
> == end means *no answer*, p.first == p.second means *single*
> answer, p.first
> != p.second means *interpolation* is needed.
> 2.
>
> Related to issue 6. How to write ExecutionPolicy version for an
> unweighted sorted quantile/median algorithm? All it does is finds range
> size and then reads one or two values, there are no parallel versions for
> std::ranges::distance() and std::ranges::advance().
> 3.
>
> Maybe it's obvious but specify that weighted quantiles can be >= 1.
> Because in section 3.2 there is 0 <= q <= 1 requirement.
> 4.
>
> Clarify unsorted_ versions. One option to implement them is in terms of
> nth_element() which mutates the original sequence. Another option is to
> use additional vector of indices and sort them, but it requires O(n)
> space.
> 5.
>
> What's the purpose of typename T = double in unweighted versions? It can
> be just typename T.
> 6.
>
> Why is there a version sorted_quantile(r, q, interpolation) without
> projection? I think that wherever we have any range, we should have
> projection for it as well, and versions without projections should be
> completely removed. The same for sorted_quantiles() et al., there should
> be projections for values, quantiles and weights ranges.
> 7.
>
> How about unsorted_quantiles()? When the number of quantiles is small,
> it can be more efficient than sorting and using sorted_ versions.
> 8.
>
> Clarify weighted sorted_ algorithms. In particular, the case when
> weights are zeros. For example, weighted median for weights
> {1,0,1,0} should
> be interpolation of 1st and 3rd elements. Clarify what should be done in
> the case when weight of the first element is already more than target
> quantile. I suppose that in such a case there's nothing to interpolate and
> the first element should be returned. Think about the complexity of the
> unsorted version of this approach. It might require up to N-1 additional
> calls to nth_element() in case when a lot of elements have zero weight.
>
> I haven't implemented the following functions, hence, not so much I can say
> for them, only minor/interface issues.
> Mode
>
> 1.
>
> Returning through the output iterator seems extremely inconvenient. What
> if the client needs only one mode or at most 3 modes? Maybe it will be
> better to provide 3 kinds of functions: return through output iterator,
> return by value, return through output range. In the last case, function
> can put more modes while range is not full.
> 2.
>
> Why does it require equality_comparable<iter_value<R>> if we need to
> compare projected<R,P>? Seems that the idea was to store iter_value in
> the hash map but it can be very inefficient. Why does it require
> std::output_iterator<O,
> std::iter_value_t<R>? Why does it output non-projected values? I think
> that it should either operate/output on projected values or on iterators
> (not pointed values!) to non-projected values. The latter will require at
> least forward iterator. The same applies to weighted version, why require
> std::equality_comparable<std::iter_value_t<Weights>>?
> 3.
>
> Why does (2, 4) require exactly std::unordered_map. Is it possible to
> allow a custom hash map and specify what operations it needs to support?
>
> Skewness
>
> 1. Make data_t a scoped enumeration. Choose a better name for it, remove
> trailing _t from enumerator names.
>
> Kurtosis
>
> 1. Make excess param a scoped enumeration instead of bool.
>
> Variance
>
> 1. Is var() really better than variance()?
>
>
> --
> Regards,
> Oleksandr Koval.
>
-----------------------------------------------------
This e-mail was sent via the Secure Web Server at the
University of Regina, Department of Computer Science
https://www.cs.uregina.ca/WebMail/
Received on 2021-05-03 09:57:48