ENH: add `quantile` function with weights support #494

cakedev0 · 2025-10-21T10:01:51Z

Resolves #340

Specs:

no weights: ND-support, propagate NaNs, supported methods: "linear", "inverted_cdf", "averaged_inverted_cdf"
- adding support for omitting NaNs shouldn't be too hard (if you assume NaNs are sorted at the end, which is what everyone does), but I don't think it's needed for now.
with weights: Only 1D or 2D support, propagate or omit NaNs, supported methods: "inverted_cdf", "averaged_inverted_cdf"
Delegates for all major supported backends when possible (numpy, cupy, torch and jax), except dask, as it was raising errors.

Comparison with NumPy:

Those specs matches NumPy quantile for nan_policy="propagate" and nanquantile for nan_policy="omit". Differences are:

using nan_policy="propagate"|"omit" (inspired by scipy) instead of two functions like numpy (quantile for nan_policy="propagate" and nanquantile. Why? Less code/doc duplication, and clearer behavior IMO.
weighted case:
- support for "averaged_inverted_cdf" method (only "inverted_cdf" method is supported by numpy): we need this in scikit-learn.
- only up to 2D: we don't need more for now.
non-weighted case:
- less methods. Why? We don't need more. I've only found one call with a method in scikit-learn and it's a deprecated method.
- no support for nan_policy="omit": there are a few calls to nanpercentile in scikit-learn. Would be easy to implement, following what has been done in scipy.
- implementation: we sort instead of relying on partitioning like numpy does. The partitioning-based implementation would be quite more complex and would not benefits perf in most cases, as currently xpx.partion relies on sorting when not delegating. This could be worked

Comparison with SciPy:

Main difference is the broadcasting/reduction behavior. We aligned on numpy for this one.

Also, SciPy doesn't support weights.

Implementation for the non-weighted case is more or less copy-pasted from SciPy, except that I've only kept support for 3 methods.

Future use of `xpx.quantile` in scikit-learn

Those specs should be enough to replace sklearn.utils.stats._weighted_percentile and to replace most uses of np.quantile/percentile when rewriting numpy-based functions to support Array API standard.

Note that the implementation for the weighted case differs a bit from sklearn's _weighted_percentile: it's mostly the same approach (sort weights based on a and compute cumulative sum), but they differ in the way edge cases are handled (null weights mostly). I believe my implementation to be easier to read and to get right, and equivalent in terms of performance (dominated by argsort for big inputs).

Still TODO:

decide whether to support dask. I don't think we should it's very slow and doesn't match the purpose of dask of mentioned here. Supporting dask would slow down the CI quite a bit.
read scikit-learn's tests for _weighted_percentile to see if there is something I missed in my tests

lucyleeow · 2025-10-21T10:28:23Z

From scipy/scipy#23832

There is a slight difference between NumPy gufunc and scipy.stats rules that has to do with prepending 1s to the shapes when the arguments have different dimensionality. Specifically, in scipy.stats, axis identifies the core dimension after 1s are prepended to match the dimensionalities.

I am assuming all delegate packages do what numpy does, in terms of broadcasting? If that is the case we should probably follow unless we think scipy has better rules?

…havior

cakedev0 · 2025-10-23T15:19:53Z

This PR is now ~95% finished, so marking it ready for review. Updated PR desc.

mathause · 2025-10-25T05:28:13Z

src/array_api_extra/_delegation.py

+            raise ValueError(msg)
+    else:
+        if method not in {"inverted_cdf", "averaged_inverted_cdf"}:
+            msg = f"`method` '{method}' not supported with weights."


only method x/ y support weights?

mathause · 2025-10-25T05:29:02Z

src/array_api_extra/_delegation.py

+
+    methods = {"linear", "inverted_cdf", "averaged_inverted_cdf"}
+    if method not in methods:
+        msg = f"`method` must be one of {methods}"


Sort methods to get a deterministic output?

I think it's deterministic already. But do you mean declaring methods in the sorted order?

Like this: methods = {"averaged_inverted_cdf", "inverted_cdf", "linear"}

I thought this was not deterministic - but maybe this was in older python versions or between different OS, or maybe I just misremembered? In any case - sorry about the noise.

src/array_api_extra/_delegation.py

mathause · 2025-10-25T05:30:59Z

src/array_api_extra/_delegation.py

+        raise ValueError(msg)
+    nan_policies = {"propagate", "omit"}
+    if nan_policy not in nan_policies:
+        msg = f"`nan_policy` must be one of {nan_policies}"


Co-authored-by: Mathias Hauser <[email protected]>

…uantile

lucascolley · 2025-10-25T17:44:06Z

@cakedev0 would you be able to fill out the PR description a little more to justify API decisions, primarily comparing how the implementation compares to what is available in NumPy and SciPy respectively?

Is there a link to a PR which would use this in sklearn, or are we not at that stage yet?

cakedev0 · 2025-10-25T19:15:50Z

Added some comparisons with numpy/scipy in the PR description.

Is there a link to a PR which would use this in sklearn, or are we not at that stage yet?

We are not at this stage yet, but I would be happy to open an "experiment PR" where I replace most calls to np.percentile/np.quantile/_weighted_percentile by calls to xpx.quantile to check that the tests pass. This would help a lot being confident about this PR (in terms of API decisions and implementation correctness).

lucascolley · 2025-10-25T19:38:43Z

Thanks, yeah, that sounds like a good idea.

cakedev0 · 2025-10-26T23:06:31Z

Updates from working on the "experiment PR" (you can see the diff here: cakedev0/scikit-learn#3):

all tests pass 🎉 (including tests for _weighted_percentile)
supporting nan_policy="omit" is important/needed. It unblocks 3 preprocessors: QuantileTransformer, RobustScaler, SplineTransformer, which is more of less half of scikit-learn things needing a xpx.quantile on the list Make more of the "tools" of scikit-learn Array API compatible scikit-learn/scikit-learn#26024 - So I'll think
supporting more methods would be a bit useful for QuantileTransformer (but I think it's fine to not support all the methods for all the backends...). Question: do we need to raise an error for a method that's implemented by numpy but not by xpx when the backend is numpy?
it would be convenient that method default to have a "default" method that default to "averaged_inverted_cdf" for the weighted-case and "linear" for the non-weighted case. Unsure if this a good practice in terms of interface though?

cakedev0 · 2025-10-27T19:43:17Z

(putting this in draft mode while I'm working on supporting nan_policy="omit" for the non-weighted case)

lucyleeow · 2025-10-28T04:32:15Z

would you be able to fill out the PR description a little more to justify API decisions,

FYI @cakedev0, there is currently a discussion about nan API in numpy: https://mail.python.org/archives/list/[email protected]/thread/P4EEL5P2PNDASWLI24CPHOW2KG2LZ3YY/

supporting more methods would be a bit useful for QuantileTransformer

Can you expand?

cakedev0 · 2025-10-28T09:37:37Z

supporting more methods would be a bit useful for QuantileTransformer

Sorry, not the QuantileTransformer but the KBinsDiscretizer, it supports various quantile methods: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_discretization.py#L59-L66

(but I guess it's fine not to support all the methods in the array API version)

mdhaber · 2025-11-04T21:43:46Z

Perhaps I can't be entirely objective, but I think there's also objective argument against a split between xpx.quantile and stats.quantile.

It sounds like the pain points are the broadcasting behavior and weights. Let's see if I can address those.

Main difference is the broadcasting/reduction behavior. We aligned on numpy for this one.

NumPy is self-inconsistent on this one. NEP 5 specifies what generalized universal functions should look like. I don't know the history - maybe quantile/percentile existed first - but they could (in principle) follow the NEP and don't. That behavior makes it impossible to take quantiles at different probabilities for different slices, which has real-world uses (e.g. BCa bootstrap). The behavior of stats.quantiles permits that and permits the Cartesian product calculation.

Now, this is not to say that either of the NumPy precedents should be the final answer for the array API or array-api-extra. I think we should do what's right for the ecosystem. Consistency with other array API functions (does anything else "broadcast" by taking a Cartesian product? Imagine if this were the standard for other array operations like addition!) and functionality (do we really want to preclude the possibility of different probabilities corresponding with each slice?) needs to be considered. If sklearn only needs the Cartesian product behavior and doesn't like the standard broadcasting rules, a lightweight wrapper can easily translate.

Also, SciPy doesn't support weights.

weights were requested in scipy/scipy#22894. I was willing to help with it, but there was no response to scipy/scipy#22794 (comment). A PR for weights appeared, but it didn't address the concerns in scipy/scipy#22894. I didn't say anything to block it, but apparently nobody else is interested, so neither has progressed. weights are not very hard to implement; it's the theoretical/interface question that is tricky.

If I can get some feedback about scipy/scipy#22644 - which is about quantile and predates the weights request - and get that merged, I would be more proactive about adding weights support to SciPy's quantile. That PR actually has a similar interface question about how to accept method-specific options, which we might have if weights were not supported by the continuous methods.

lorentzenchr · 2025-11-05T16:26:42Z

After getting weights into numpy quantile/percentile, I was so exhausted to continue pushing it elsewhere. At least now, there is one reference, numpy. I would love to see it adopted elsewhere. Quantiles are just very important in many fields.

mdhaber · 2025-11-05T16:56:43Z

I drafted it for stats.quantile last night, and I studied existing uses of weights in scipy.stats. I'll submit a PR today or tomorrow.

The same code works for all H&F methods. I was a little concerned about non-integer weights before, but I don't think there's a real problem. For integer weights, it does the same thing as repeated samples, so we can document that these are frequency weights¹. Its just that for most methods, the interpolation actually does depend on the total amount of data / total weight.

import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(328983495438219)
p = np.linspace(0, 1, 300)
res = np.quantile([0, 1], p, method='linear')
res2 = np.quantile([0, 0, 1, 1], p, method='linear')
res3 = np.quantile([0, 0, 0, 1, 1, 1], p, method='linear')
plt.plot(p, res, p, res2, p, res3)

If they can be interpreted as some other kind of weight, too, great. But almost all uses of weights in stats turn out to be frequency weights. ↩

mdhaber · 2025-11-06T07:20:44Z

PR is up at scipy/scipy#23941. Here's a wrapper that hamstrings stats.quantile to get NumPy's Cartesian product behavior:

import numpy as np
from scipy import stats

def quantile_cartesian_product(x, p, method='linear', axis=0):
    p_shape = p.shape
    p = np.ravel(p)
    x = np.moveaxis(x, axis, -1)
    res = stats.quantile(x, p, method=method, axis=-1, keepdims=True)
    if len(p_shape):
        res = np.moveaxis(res, -1, 0)
        res = np.reshape(res, p_shape + res.shape[1:])
    else:
        res = np.squeeze(res, axis=-1)
    return res

# test
rng = np.random.default_rng(234892398458345982)
x = rng.random((6, 7, 8))

for axis in [-1, 0, 1]:
    for p_shape in [(), (1,), (2,), (2, 3)]:
        p = rng.random(p_shape)
        kwargs = dict(axis = 1, method='weibull')
        res = quantile_with_wonky_cartesian_product_behavior(x, p, **kwargs)
        ref = np.quantile(x, p, **kwargs)
        np.testing.assert_allclose(res, ref)

lucascolley · 2025-12-01T16:17:15Z

where does this stand now that scipy/scipy#23941 has merged?

cakedev0 · 2025-12-08T15:57:05Z

Personally, I don't really agree with some decisions that were made in scipy.stats but I was unable to make comments on the PR due to my wrist issues.

Also, I'm more busy lately and I don't want to spend time porting the quantile function from scipy here, it's really big now and relies on a lot of internal scipy helpers, it will be a lot of work.

Being very new to the ecosystem I'm not sure my opinion is well informed, still here are my disagreements with the choices that were made in scipy:

About the broadcasting behavior: I agree that the behavior of scipy.stats.quantile is indeed the most flexible.
But I don't think it's natural. In the doc, it's written that it behaves the same as other functions in scipy.stats but I disagree. Other functions I could find with broadcasting behavior in scipy.stats all work this way:
f(x, y, axis) -> broadcast(x, y); apply f along axis (and reduce it away if keepdims=False).
In those, x and y are of the same nature, so it's natural to broadcast them together. And, then axis is reduced away. It's equivalent to the behavior you'll get for f(x, y, axis): return (x + y).sum(axis=axis). Very natural.
For the quantile, first it's not very natural to broadcast "data" with "parameters", and secondly the reduction behavior is much more complex. One thing I really don't like is that the output dimension is not determined by the input dimensions, for instance: both quantile([0, 1], [0.5]) and quantile([0, 1], 0.5) have dimension 0 but quantile([0, 1], [0.5, 0.9]) has dimension 1.

So for me, it's not clear which option is the best. IMO, here are the pros and cons of the scipy approach:
+++ more powerful, unlocks things that would require a Python loop otherwise. But how much this loop would cost? It's generally very ok to loop on 100 features for instance.
-- more complex behavior for users, changes from what they are used to for this function (all other array libraries align with numpy I think).
- for array-api-extra: more complex implementation, more complex delegation code, less cases where delegation is possible.

About the weights: personally, I don't understand the interest of frequency weights when you have a valid interpretation only for integers. Those should be called counts, not weights, and they are just an optimization, not a feature, because f(x, weights=counts) returns the same result as f(xp.repeat(x, counts)), it just computes it in a more efficient way.
So I don't think supporting weights as frequency weights for all methods is a good idea, it would be better to implement a counts argument if the optimization is needed.

For methods "inverted_cdf" and "averaged_inverted_cdf", there are clear interpretations for real-valued weights: the one in the numpy doc, or the link with the argmin of the weighted pinball loss (which is why scikit-learn needs weighted quantile in many cases I think).

lucascolley · 2025-12-08T16:26:39Z

I don't have a horse in the race w.r.t. the nicest user interface, so I will leave that discussion to those already involved.

What I will say is that from an ecosystem perspective, now that SciPy has committed to the public API, I would in principle be happy with a situation where we:

offer a certain API in array-api-extra which has some limitations but is more lightweight and familiar for users of preceding tools
direct users to SciPy for use-cases not covered by that implementation, or for situations where things like the SciPy broadcasting may result in code that is easier to understand

The downside of having two functions with the same name providing different semantics is real. But if the consensus from the technical discussion is that there are real tradeoffs such that each set of semantics has situations in which it is preferable, that might be acceptable.

mdhaber · 2025-12-08T16:30:37Z

Naturalness is subjective, so we valued consistency with the rest of scipy.stats¹. Unfortunately, this meant lack of consistency with np.quantile because np.quantile was an anomaly and didn't respect decisions made by NumPy previously (NEP 5).

One thing I really don't like is that the output dimension is not determined by the input dimensions,

Of course they are. But like every other stats function, they also depend on keepdims, which controls whether the axis is reduced away. The default of keepdims is False in all other functions because it can always be False. We tried to keep in line with that by making the default False where possible. (It has to be True if the second argument has length > 1 along axis; )

Now, I don't necessarily like the existence of keepdims as a parameter throughout stats. But it is what it is.

But I don't think it's natural...

Every other independent two-sample statistic in SciPy (e.g. ttest_ind, mannwhitneyu) broadcasts the arrays exactly as quantile does. You can tell because they use the same function to do the broadcasting. So quantile does f(x, y, axis) -> broadcast(x, y); apply f along axis (and reduce it away if keepdims=False) just as described.

If you mean "they don't broadcast the arrays exactly like np.broadcast_arrays", then you're right. We make one modification to the broadcast_arrays rule because we have to generalize to the case in which the length along axis is not equal.

After that, the only difference is that the length of the output core dimension of quantile is that of the second argument, and the length of the output core dimension of other statistics is 1.

About the weights: personally, I don't understand the interest of frequency weights when you have a valid interpretation only for integers.

About a dozen functions in SciPy accept frequency weights via the weights parameter, so we stuck with that name instead of inventing a new one like counts. And also because sometimes the implementation of frequency and other kinds of weights are exactly the same: for inverted_cdf and averaged_inverted_cdf, the weighted statistic is calculated the same way as NumPy and as proposed here. So the weights are not limited to integers, and you can interpret them however you wish. There is no limitation here; we just added the opportunity to use frequency weights with other methods.

which I find quite natural, thinking about arrays as arrays. I do not introduce a notion of some arrays being "data" and others being "parameters"; from that perspective, I can see how array broadcasting in general would be unintuitive. ↩

cakedev0 added 2 commits October 21, 2025 11:58

Draft

a6f6c93

Merge remote-tracking branch 'upstream/main' into quantile

d30bcbf

cakedev0 marked this pull request as draft October 21, 2025 10:05

cakedev0 added 2 commits October 21, 2025 12:06

revert changes to renovate.json

dc236da

revert changes to renovate.json

f92fc4b

lucascolley changed the title ~~FEA: Add quantile function - method="linear", no weights~~ ENH: add quantile function - method="linear", no weights Oct 21, 2025

lucascolley added enhancement New feature or request new function labels Oct 21, 2025

lucascolley mentioned this pull request Oct 21, 2025

ENH: add quantile #341

Closed

cakedev0 added 7 commits October 21, 2025 18:39

untested implem; limited to method="linear"; trying to mimic numpy be…

06e370a

…havior

remove unused imports

dc7a1e5

draft version with some tests that are passing

98fe39f

linting: fix pyright

034c064

linting: fix mypy

05ffb7b

fixed linting

89d8410

WIP: adding support for weights

19fa6ea

This was referenced Oct 22, 2025

ENH: add quantile #340

Open

BUG: nanpercentile/nanquantile does not broadcast weights as the doc suggested numpy/numpy#29709

Open

cakedev0 added 10 commits October 23, 2025 07:37

Weighted quantile; nan-policy; everything mostly works

fa789fc

linting: pyright & mypy

1d8fef7

linting: ruff

26804fe

linting & cleanup

3611708

fix tests for numpy 1.x

7160bae

working on coverage

0b2cb9b

working on coverage

3226659

more coverage

c395b84

fix test

07f7007

second attempt to fix test

e319529

cakedev0 changed the title ~~ENH: add quantile function - method="linear", no weights~~ ENH: add quantile function with weights support Oct 23, 2025

cakedev0 marked this pull request as ready for review October 23, 2025 15:19

cakedev0 added 2 commits October 23, 2025 18:04

more validation

8ab7d62

some more tests

1b48267

cakedev0 mentioned this pull request Oct 23, 2025

FIX: fix weighted-nanquantile broadcasting issue. numpy/numpy#30066

Open

mathause reviewed Oct 25, 2025

View reviewed changes

cakedev0 and others added 3 commits October 25, 2025 11:34

Fix typo in err msg

c71351f

Co-authored-by: Mathias Hauser <[email protected]>

avoid sorting a; just sort the weights

ce55335

Merge branch 'quantile' of github.com:cakedev0/array-api-extra into q…

0353767

…uantile

cakedev0 mentioned this pull request Oct 25, 2025

Add Weighted Quantile and Percentile Support to jax.numpy jax-ml/jax#32737

Closed

return max values when all weights are null

7c18a82

cakedev0 marked this pull request as draft October 27, 2025 19:43

mdhaber mentioned this pull request Nov 6, 2025

ENH: stats.quantile: add support for frequency weights scipy/scipy#23941

Merged

ENH: add quantile function with weights support #494

Are you sure you want to change the base?

ENH: add quantile function with weights support #494

Conversation

cakedev0 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Specs:

Comparison with NumPy:

Comparison with SciPy:

Future use of xpx.quantile in scikit-learn

Uh oh!

lucyleeow commented Oct 21, 2025

Uh oh!

cakedev0 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathause Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

mathause Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

cakedev0 Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

mathause Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mathause Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

lucascolley commented Oct 25, 2025

Uh oh!

cakedev0 commented Oct 25, 2025

Uh oh!

lucascolley commented Oct 25, 2025

Uh oh!

cakedev0 commented Oct 26, 2025

Uh oh!

cakedev0 commented Oct 27, 2025

Uh oh!

lucyleeow commented Oct 28, 2025

Uh oh!

cakedev0 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdhaber commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Nov 5, 2025

Uh oh!

mdhaber commented Nov 5, 2025

Footnotes

Uh oh!

mdhaber commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucascolley commented Dec 1, 2025

Uh oh!

cakedev0 commented Dec 8, 2025

Uh oh!

lucascolley commented Dec 8, 2025

Uh oh!

mdhaber commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ENH: add `quantile` function with weights support #494

ENH: add `quantile` function with weights support #494

cakedev0 commented Oct 21, 2025 •

edited

Loading

Future use of `xpx.quantile` in scikit-learn

cakedev0 commented Oct 23, 2025 •

edited

Loading

cakedev0 commented Oct 28, 2025 •

edited

Loading

mdhaber commented Nov 4, 2025 •

edited

Loading

mdhaber commented Nov 6, 2025 •

edited

Loading

mdhaber commented Dec 8, 2025 •

edited

Loading