You're proposing an interface with templates. Are you sure they can be
implemented efficiently inline for all architectures? Shouldn't they be out-of-
line and provide only int, long, and long long support via overloads?

I'm not really sure what you mean by that. There is no guarantee that "long long" or "int" are "efficient" either, and the standard generally doesn't care about this. The other functions in <numeric> such as saturating arithmetic or gcd are also templates, and accept any integer type. Whether that is "efficient" is a QoI issue. I'm just following conventions.
 
Paralleling the std::div function, should this return div_t? On some
architectures, the division instruction already provides the modulus, so it
would be useful to return both at the same time.

To my knowledge, std::div is a historical relic which mostly exists because the rounding mode of division was unspecified in C, but was truncating for std::div. There's really no point in using it anymore. Every optimizing compiler will fuse separate division and remainder operations into one div instruction.
 
In your reference
implementation, every single std::div implementation is calculating both (x/y)
and (x%y), so it probably won't be too expensive to just return both quotient
and remainder anyway. In any case, given the precedent API that std::div
poses, it's probably better to follow it for anything whose name starts with
std::div and find another name for quotient-only results.

It's only superficially true that the reference implementation always computes the remainder. Many of the functions only need to check if the remainder is nonzero, which is possibly cheaper. For example, if the optimizer knows that the dividend is lower than the divisor, there will always be a remainder (unless it's zero). Also, the remainder still needs to be adjusted to match the adjustments made to the quotient for the rounding mode. You're not getting the remainder entirely for free.

Anyhow, I think you're reading too much into std::div. Combined quotient/remainder functions seem unmotivated to me, and they're not entirely trivial to implement if you need high QoI instead of just computing the remainder in terms of the quotient. For a lot of these functions, the remainder isn't particularly interesting anyway, and you could just compute it in terms of the quotient if you needed it.

Also note that virtually every C++ resource you will find online calls these functions div_floor and div_ceil and whatnot. I care much more about making C++ users happy than being consistent with old junk in the standard.
 
I'd also suggest this be proposed to the C standard. There's nothing here that
requires C++ or benefits only C++. C2y may get constexpr functions and, even if
they don't, it would be easy for implementations to provide the std:: versions
with constexpr with a suitable if consteval.

It would be cool to have this in C too, yeah. Maybe the design on C++ side should be ironed out first though.
 
Finally, why so many options? Do we really need them? Why can't we settle for
the standard ones from the FP environment: nearest (odd/even tie breaking),
towards +Inf, towards -Inf, and towards zero?

At the very least, rounding away from zero should also be a thing. It's quite common that you'd want to scale down, say, survey data with a max of [-n, n] points onto a range of [-10, 10], and round either away from zero or towards zero to not introduce a bias towards one side.

If you map test scores from, say, 0 to 100 onto an even amount of grades with nearest rounding, it would also make little sense to round to even/odd rather than consistently to negative infinity, positive infinity, or zero.

In general, any one of the rounding modes is somewhat useful in certain domains/use cases, and I don't see a benefit to minifying the proposal. Why should we be limited to a handful of choices inspired by <math.h> design from 40 years ago? Whether the proposal has 5 functions or 25 makes little difference in the grand scheme of things. The wording is small, the implementation is only a few (relatively similar) lines each, and all of this is simple, numeric, header-only stuff.

The implementation cost for anything to do with floating-point is much greater, and so different standards and design practices apply.