Skip to content

Benchmarks

quantile-guard fits all quantiles jointly with non-crossing constraints. Independent fitters (sklearn, statsmodels) fit each quantile separately with no monotonicity enforcement. These benchmarks measure the practical difference on deliberately challenging data.

Test conditions: heavy-tailed heteroscedastic noise (Student-t, df=3), 10-20 features, up to 13 quantile levels. This is data designed to stress quantile estimators — in practice, your data may be gentler, but the guarantee still matters for production pipelines.

Crossing Rate

Fraction of test samples where at least one quantile prediction violates monotonicity:

n p quantiles quantile-guard sklearn statsmodels
500 10 7 0% 11.0% 11.0%
500 10 13 0% 30.0% 30.0%
1,000 10 7 0% 6.0% 4.0%
1,000 10 13 0% 16.5% 15.0%
2,000 20 7 0% 4.5% 4.5%
2,000 20 13 0% 11.0% 11.0%
5,000 20 7 0% 0.0% 0.0%
5,000 20 13 0% 0.4% 0.4%

Crossings are worst when:

  • the data has heavy tails and heteroscedasticity
  • many closely-spaced quantiles are fitted (13 vs 7)
  • the sample is small relative to the number of features

At n=500 with 13 quantiles, 30% of test samples have crossings in sklearn/statsmodels. quantile-guard has zero by construction.

Pinball Loss

The joint non-crossing formulation matches pinball loss closely and improves it in the hardest small-sample settings. The improvement is most visible at small n with many quantiles, where the non-crossing constraints can act as beneficial regularization.

n p quantiles quantile-guard sklearn statsmodels
500 10 7 0.5148 0.5166 0.5166
500 10 13 0.5095 0.5240 0.5240
1,000 10 7 0.5082 0.5091 0.5084
1,000 10 13 0.5048 0.5071 0.5067
2,000 20 7 0.5604 0.5606 0.5606
2,000 20 13 0.5599 0.5611 0.5611
5,000 20 7 0.5925 0.5925 0.5925
5,000 20 13 0.5893 0.5896 0.5896

At n=500, 13 quantiles: quantile-guard achieves 0.5095 vs 0.5240 — a 2.8% improvement from the joint formulation.

The Speed Tradeoff

quantile-guard is slower on raw wall-clock time. That's the cost of solving a single joint LP with non-crossing constraints, rather than 7 or 13 separate small LPs.

n p quantiles quantile-guard (sparse) sklearn (sum of fits) statsmodels (sum of fits)
500 10 7 0.8s 0.1s 0.3s
500 10 13 3.0s 0.2s 0.4s
1,000 10 7 2.7s 0.3s 0.1s
1,000 10 13 10.0s 0.5s 0.3s
2,000 20 7 18.5s 1.3s 3.4s
2,000 20 13 172.3s 2.4s 5.4s
5,000 20 7 124.9s 7.0s 4.2s
5,000 20 13 676.3s 12.9s 19.3s

What the extra time buys you

  • Zero crossings — no post-hoc fixes, no downstream pipeline failures
  • Joint estimation — all quantiles fitted together, sharing information
  • Better pinball loss — the non-crossing constraints regularize beneficially
  • One fit call — inference, intervals, and diagnostics all come from the same model

If you need only a single quantile with no crossing concerns, sklearn or statsmodels will be faster. quantile-guard's value is in the joint multi-quantile fit with guarantees — and the inference, calibration, and evaluation tools built around it.

Speeding things up

For smaller problems, use solver_backend='GLOP' for the simplex solver. For memory-constrained settings, use use_sparse=True.

Empirical Coverage

Coverage of the interval formed by the outermost quantile predictions (e.g., [0.05, 0.95] for 7 quantiles). Marginal coverage is broadly similar across methods on this synthetic benchmark, though the independent fits can deviate more when crossings are severe.

n p quantiles Nominal quantile-guard sklearn statsmodels
500 10 7 90% 93.0% 92.0% 92.0%
500 10 13 98% 97.0% 94.0% 94.0%
1,000 10 7 90% 89.5% 89.0% 88.5%
1,000 10 13 98% 98.0% 95.5% 95.5%
2,000 20 7 90% 89.0% 88.2% 88.2%
2,000 20 13 98% 95.8% 93.8% 93.8%
5,000 20 7 90% 89.1% 89.1% 89.1%
5,000 20 13 98% 97.6% 97.7% 97.7%

For better-calibrated intervals, use Conformalized Quantile Regression.

Test Data

All benchmarks use synthetic data with:

  • Heavy-tailed noise: Student-t with 3 degrees of freedom
  • Heteroscedasticity: noise scale grows with feature values
  • Fixed seed: fully deterministic and reproducible

This is deliberately challenging data. On well-behaved Gaussian data, crossing rates would be lower — but the guarantee still matters for production pipelines.

Beyond Accuracy: The Full Toolkit

quantile-guard is more than a quantile regressor. The benchmark comparison above covers only the core fitting — the package also provides:

Capability What it adds
Standard errors Analytical, kernel, cluster-robust, bootstrap
Conformal calibration Built-in CQR with finite-sample coverage guarantees
Calibration diagnostics Coverage by group, bin, feature; sharpness analysis
Evaluation metrics Pinball loss, coverage, interval score, crossing rate
Crossing detection + repair Diagnose and fix crossings from any model
Censored QR Right- and left-censored survival models
Regularization L1, elastic net, SCAD, MCP

sklearn's QuantileRegressor does not provide this end-to-end workflow, and statsmodels' QuantReg covers only part of the inference story.

Reproducing These Results

pip install -e ".[benchmark]"
python benchmarks/run_linear_baselines.py
python benchmarks/report.py

Results are deterministic (fixed random seeds). Raw CSV output includes Python version, platform, and package version metadata.

See benchmarks/README.md for details.