Validation & Statistical Tests¶
How to determine whether your backtest results are real or noise.
The Problem¶
A Sharpe ratio of 0.8 looks great — but is it statistically significant? With enough parameter configurations, you will find one that looks good purely by chance. Quant101 provides a full validation toolkit to answer this.
Walk-Forward Analysis¶
Split data into rolling train/test windows with a purged embargo gap to prevent look-ahead bias:
from validation.walk_forward import walk_forward_split
folds = walk_forward_split(
dates,
train_days=126, # ~6 months training
test_days=63, # ~3 months testing
embargo_days=5, # 5-day gap to prevent leakage
mode="rolling", # or "anchored"
)
Run the full pipeline per fold:
from portfolio.walk_forward_runner import run_walk_forward
wf = run_walk_forward(ohlcv, config=config)
print(f"Mean OOS Sharpe: {wf['mean_oos_sharpe']:.3f}")
print(f"Sharpe decay: {wf['sharpe_decay']:.3f}")
# Decay > 0 → IS Sharpe > OOS Sharpe (overfitting signal)
Bootstrap Confidence Intervals¶
Circular block bootstrap preserves autocorrelation structure in return series:
from validation.statistical_tests import bootstrap_sharpe_ci
ci_low, ci_high = bootstrap_sharpe_ci(
returns,
n_bootstrap=10000,
confidence=0.95,
)
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")
Warning
If the 95% CI includes zero, you cannot reject the hypothesis that your Sharpe is indistinguishable from noise.
Probabilistic Sharpe Ratio (PSR)¶
From Bailey & de Prado (2012) — the probability that the true Sharpe exceeds a benchmark, accounting for skewness and kurtosis:
from validation.statistical_tests import probabilistic_sharpe_ratio
psr = probabilistic_sharpe_ratio(returns, sr_benchmark=0)
print(f"PSR: {psr:.1%}") # e.g., 91.8%
PSR > 95% required for confidence. Adjusts for non-normal returns (unlike a simple t-test on Sharpe).
Deflated Sharpe Ratio (DSR)¶
From Bailey & de Prado (2014) — adjusts PSR for multiple trials. If you tested 16 configurations, the expected maximum Sharpe by chance increases. DSR corrects for this:
from validation.statistical_tests import deflated_sharpe_ratio
dsr = deflated_sharpe_ratio(returns, n_trials=16)
print(f"DSR: {dsr:.1%}")
Our Result
For the BBIBOLL factor across 16 configurations:
- Best config PSR = 91.8%
- DSR = 34.2% (after adjusting for 16 trials)
- Verdict: not statistically significant
Multiple Testing Corrections¶
When testing many hypotheses simultaneously, adjust p-values:
from validation.multiple_testing import apply_all_corrections
p_values = [0.03, 0.05, 0.12, 0.01, ...] # One per config
results = apply_all_corrections(p_values)
# results["bonferroni"] ← Most conservative (FWER control)
# results["holm"] ← Step-down (FWER, more powerful)
# results["benjamini_hochberg"] ← FDR control (most liberal)
| Method | Controls | Use When |
|---|---|---|
| Bonferroni | Family-wise error rate | Very conservative, few tests |
| Holm-Bonferroni | Family-wise error rate | Moderate, ordered step-down |
| Benjamini-Hochberg | False discovery rate | Many tests, accept some false positives |
The Validation Gauntlet¶
The recommended validation sequence for any new factor:
1. Walk-forward IS/OOS Sharpe → Is there signal OOS?
2. Bootstrap Sharpe CI → Does CI exclude zero?
3. PSR → P(true Sharpe > 0)?
4. DSR (with n_trials) → After correction for snooping?
5. Multiple testing on config sweep → Any config survives BH?
6. Sub-period stability → Consistent across half-years?
7. Cost-adjusted Sharpe → Profitable after 5 bps?
If a factor survives all seven, it's worth keeping. Most won't. That's the point — failing cheaply and learning from the failure.
See notebooks/validation.ipynb for the full 16-config gauntlet.