“Inside the Engine: How Our Machine Learning System Delivers Weekly Probabilities with Real-World Confidence”

From Red/Green Signals to Real Confidence: How to Think Like a Probabilistic Investor

Table of Contents

Modeling Objective: Estimating Short-Term Probabilities
Model Architecture: From Features to Forecasts
Model Selection and Ensemble Weighting Logic
Time-Aware Validation: Real-World Reliability
Performance Metrics and Probability Bands
Reliability, Entropy, and Confidence
Production Model Workflow and Live Updates
Summary and Closing Insights
Glossary
References

1 Modeling Objective: Estimating Short-Term Probabilities

In applied finance, the difference between a forecast and a usable edge lies in the clarity of its objective. Execution refers not to literal trade orders but to actionable decisions guided by forecast probabilities.

What is the probability that this stock will generate a positive return over the next five trading days?

This question is neither vague nor cosmetic. It is the core modeling target around which all learning, calibration, and validation are structured. The focus on a 5-day horizon is deliberate — short enough to reduce regime uncertainty, yet long enough to capture exploitable market patterns for swing traders and options sellers.

Why a Five-Day Horizon?

Empirical research has consistently demonstrated that forecasting accuracy declines as the prediction horizon increases, due to the accumulation of structural noise, regime uncertainty, and external shocks (Koutsandreas et al., 2022). Longer horizons introduce compounding sources of variance that impair the reliability of both model inference and actionable signal extraction.

A five-day horizon represents a calibrated trade-off between statistical fidelity and economic utility. It is sufficiently short to retain alignment with recent market dynamics, yet long enough to accommodate the operational cadence of execution strategies. Modeling a single rolling five-day return probability per asset enables robust adaptation to temporal shifts while preserving generalizability across out-of-sample periods.

Why Probability — Not Direction

Traditional retail and institutional trading models often aim to classify direction (up/down) or predict a point return (e.g., +1.2%). Both approaches are limited in risk-aligned application:

Binary classifiers (0 = down, 1 = up) fail to reflect the confidence level in the signal. Even when probabilistic outputs are available, most binary classifiers used in retail systems are uncalibrated, identically treating a 51% and a 91% prediction.
Point estimates (regression models) introduce noise, especially in short-term horizons where standard deviations of return can exceed the mean by a factor of 5 or more.

Instead, we train models to produce a calibrated probability between 0 and 1 that the 5-day return will be positive. This is a probabilistic forecast in the formal sense described by Gneiting & Katzfuss (2014) and Gneiting & Raftery (2007), where each output represents a predictive distribution, not a deterministic guess.

This allows traders to filter opportunities, size positions, and assess risk on a continuous spectrum of confidence, rather than responding to oversimplified binary signals.

Intended Applications

This 5-day probability output is used in two principal ways:

(1) Swing Trades (Common Stock)

The probability forecast is used to define whether to enter a swing trade, its direction, position sizing, and exit rules — all based on empirically derived rules implemented through the proprietary EDTL framework (Entry-Discount-Target-Level).

Entry Logic (Directional Filtering)
The system establishes entry direction using the calibrated 5-day probability output:

Long Positions: initiated when the forecast probability exceeds 70%.
Short Positions: initiated when the forecast probability falls below 30%.

These thresholds are derived from out-of-sample band testing and reflect the meaningful upper and lower bounds of statistically significant predictive confidence. The model does not provide binary buy/sell signals — it identifies conditions under which historical returns and risk-adjusted metrics have been most favorable for initiating a directional position.

Position Sizing (Discount-Based Structure)
Rather than using a fixed or volatility-based position size, our swing trades use our customized EDTL (Entry–Discount–Target–Level) system, a data-driven adaptation of traditional averaging methods.

Each trade is initialized with a level-based structure: L = the number of potential add-on levels if the price declines. This defines the maximum number of add-on levels (lots) required to maintain optimal compounding with 95% confidence.
The system estimates this ‘L’ using historical simulations of return trajectories — identifying the optimal number of entries (e.g., 3, 5, 8) that maximize XIRR (Extended Internal Rate of Return), not just cumulative gain.
These entry levels are defined by price discounts from the initial entry, with intervals selected to optimize compounding and capital deployment efficiency.
Position sizing is inversely proportional to the number of potential entries — the more potential adds, the smaller the initial allocation.

This approach balances capital flexibility with statistical evidence, allowing the trader to size for expected compounding rather than just theoretical edge. This is our money management system component.

Exit Logic (Target-Based or Probabilistic Stop)
Exits are governed by one of two rules, both designed to preserve XIRR advantage and model integrity:

Target-Based Exit
- Each trade includes a predefined percent-return target, calibrated to maximize XIRR within the add-based entry path.
- These target levels are derived from historical optimization.
Probabilistic Stop-Loss
- Alternatively, a position is exited early if the 5-day forecast probability drops below a threshold (typically 30–40%) after entry.
- This dynamic stop reflects a meaningful deterioration in model confidence and signals that the initial thesis has weakened.

This dual-exit system preserves strategic flexibility and quantitative discipline. It avoids arbitrary profit-taking while enabling the trader to exit under controlled statistical deterioration, consistent with Bayesian updating principles and confidence-weighted risk management frameworks (see Berger et al., 2009).

(2) Cash-Covered Put Strategy (Strike Window Selection from Historical Distributions)

In our framework, the five-day forecast probability is used not to determine a strike price directly, but to filter when to participate and inform the context for constructing a strike price window. This system is applied exclusively to short option strategies — we do not buy options — and is most relevant to cash-covered puts (or bare, cash-covered CALLs).

The process proceeds in two steps:

Step 1: Participation Filter (Probability Threshold)

Before a trade is considered, the system evaluates whether the forecasted five-day probability exceeds a stock-specific threshold. This method supports both short puts (when probability exceeds 70–75%) and short calls (when probability is below 30-40%), depending on the directional signal from the model.

If the forecasted probability is below the threshold, the system excludes the stock for that cycle.
If it passes, we generate a range of acceptable strikes — a strike window — based on empirical downside behavior.

Step 2: Strike Window Determination (Dual Historical Anchoring)

Once a stock is deemed eligible, the system identifies two distinct strike prices, both derived from the historical five-day return distributions:

Conservative Strike (Full Historical Distribution – ~2,000 cases):
- This strike level represents the 10th percentile of historical returns for a specific weekday-to-Friday window (e.g., Monday to Friday, Tuesday to Friday, etc.), rather than a rolling 5-day period. It captures the price below which the stock has closed only 10% of the time across that exact calendar interval since 2008.
- This approach accounts for day-of-week effects in market behavior by anchoring returns to fixed trading-day sequences rather than generic rolling windows. The resulting distribution incorporates a full range of market regimes—bullish, bearish, and sideways—making it a conservative estimate of downside risk specific to the setup being evaluated.
Contextual Strike (Recent Analogues in Unseen Data – ~800 cases):
- The second strike reflects the current probability context.
- From the most recent five years of validation-period data, we extract all five-day windows (~800 cases) and isolate those where the model’s forecasted probability matched the current range (e.g., 70% +).
- Within this filtered subset, typically yielding 400+ analogues, we identify the bottom 10th percentile.
- This strike captures how the stock has behaved under similar confidence conditions in recent market environments, offering a more aggressive but empirically grounded alternative.

Together, these two strike levels form a strike window — a data-driven range within which the trader can select a short put position based on their personal risk preference, premium yield requirements, or volatility exposure.

The lower (more conservative) strike ensures historical robustness across all regimes.
The higher (more context-sensitive) strike leverages recent analogues for potential yield enhancement.

This dual-anchor method reduces curve-fitting risk, avoids overreliance on any single regime, and provides a flexible, data-driven strike selection process. It also ensures that the risk of breach remains historically bounded, with no assumptions beyond the data.

By combining a calibrated probability filter with empirical downside boundaries, the model enables a repeatable framework for sizing and selecting short put trades — one that aligns with documented best practices in model validation, historical conditioning, and regime awareness (Gneiting & Raftery, 2007).

2 Model Architecture: From Feature Sets to Forecasts

Constructing a statistically valid and practically useful model for short-term stock prediction demands more than algorithm choice. It requires a layered architecture grounded in generalization, out-of-sample validation, and confidence calibration — all designed to produce probability forecasts that remain robust across shifting regimes (Gneiting & Katzfuss, 2014; Niculescu-Mizil & Caruana, 2005; Dietterich, 2000).

Per-Stock Modeling Philosophy

Most retail and AI trading platforms apply a single model across many securities — a method that ignores individual stocks’ non-stationary, idiosyncratic behavior. In contrast, our framework trains a dedicated model ensemble for each stock, allowing it to learn that instrument’s unique volatility regime, feature interactions, and noise characteristics.

Ensemble literature emphasizing local volatility patterns, idiosyncratic structure, and structural non-stationarity supports the superiority of per-stock models over global ones in financial applications (Sonkavde et al., 2023; Abolmakarem et al., 2024).

Multi-Model Ensemble Structure

Each stock’s forecast is built from a layered ensemble of models:

Stage 1: Base Models (8 Total, with future expansion planned for V4)

We generate eight distinct base models per stock. Four Random Forest (RF) and four XGBoost (XGB) models per stock, each trained on a distinct feature subset. Ensemble diversity has improved predictive stability, especially in high-noise domains such as financial time series (Dietterich, 2000; Gneiting & Raftery, 2007; Shrivastav & Kumar, 2022).

Each base model is trained on a unique feature subset, selected for:

Low correlation with other models,
Emphasis on distinct market factors (e.g., trend, volatility, flow), and
Historical performance across Brier Score, AUC, and PPV.

The feature universe consists of approximately 150 indicators, including both widely known technical indicators (such as RSI, MACD, Bollinger Bands, ATR) and proprietary constructs. These indicators span multiple time horizons — from short-term 5-day metrics to annual measures — allowing the model to adapt to different market cycles. Each base model is built using a greedy forward selection process, optimizing performance across Brier Score, AUC, and PPV, resulting in multiple feature sets containing between 7 and 20 features per model.

Research confirms that diverse, indicator-rich inputs — spanning volatility, momentum, and trend constructs — improve performance in both tree-based and hybrid ensemble models (Sonkavde et al., 2023; Abolmakarem et al., 2024; Kumbure et al., 2022).

Stage 2: Stacked Meta-Models (3 Total)

To consolidate insights from the base models, we build 3 stacked meta-models:

A logistic regression (GLM) stack,
A Random Forest stack, and
An XGBoost stack.

Each meta-model is trained exclusively on the out-of-sample, calibrated probability outputs from the eight base models, ensuring a clean separation between raw features and higher-order ensemble learning. This design reinforces probabilistic integrity while preserving temporal structure. Validation is performed using k-fold cross-validation with chronological splits, mitigating leakage and preserving forward-facing generalization.

By leveraging ensemble diversity at the meta-model level, the stacked architecture further reduces variance and enhances robustness across shifting market regimes (Dietterich, 2000; Niculescu-Mizil & Caruana, 2005).

Stage 3: Weighted Averaging

A performance-weighted ensemble is applied to the calibrated stacked models, where weights are optimized based on Brier Score (calibration), AUC (discrimination), and Precision (edge quality) — a framework aligned with best practices in scoring rule–based model selection (Gneiting & Raftery, 2007; Koutsandreas et al., 2022).

This final prediction becomes each stock’s calibrated consensus probability forecast over the next five days.

Feature Sets and Diversity Design

This diversity in feature construction — spanning trend, volatility, momentum, and regime-awareness — increases ensemble robustness. Each model uses a distinct subset (7–20 features) drawn from the broader 150-feature universe, minimizing overlap and overfitting.

Full details on model calibration, validation design, and metric integration are presented in the following sections.

Summary Architecture Flow

Here is a simplified flow of the architecture for each stock we follow:

Raw Data → Feature Engineering → Stock-Specific Dataset
→ 8 Base Models (RF/XGB, diverse features)
→ Out-of-Sample Calibrated Predictions from Base Models
→ 3 Stacked Meta-Models (GLM, RF, XGB)
→ Performance-Weighted Calibrated Ensemble from Calibrated Stacked Model Predictions
→ Final 5-Day Probability Forecast

This modular design — from feature-specific base models to calibrated stacking and weighted consensus — yields a probability forecast that is statistically interpretable, strategy-aligned, and resistant to regime-induced degradation.

3 Model Selection and Ensemble Weighting Logic

A key strength of our framework lies in building accurate models and selecting and combining them through a structured, metric-driven process that maximizes reliability and minimizes overconfidence. Statistical learning theory widely supports Ensemble learning for its ability to reduce variance, improve generalization, and enhance predictive stability, particularly in noisy and non-stationary domains like financial markets (Dietterich, 2000; Kumar et al., 2023).

Our multi-layered selection process incorporates performance-based filtering and calibrated weighting at each modeling tier, from base models to final ensemble outputs.

Base Model Selection

Each stock begins with a universe of over 150 engineered and technical indicators. A greedy forward selection process generates 150 candidate feature sets from this pool. Based on out-of-sample validation performance, the top eight are retained—four for Random Forest (RF) and four for XGBoost (XGB) models.

Each model is evaluated using:

Brier Score (for probability calibration)
AUC-ROC (for ranking discrimination)
Precision (PPV) (for decision-level quality)
Error entropy metrics (for stability)
Calibration plots (for reliability assessment)

Models are only accepted if they produce non-correlated, high-confidence probability forecasts and are then passed through calibration using either Platt Scaling or Beta Calibration, depending on the shape and spread of the predicted probabilities (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017).

Stacked Model Inclusion

Stacked models—one each using GLM, RF, and XGB—are trained exclusively on the calibrated probability outputs of the base models. This design separates raw features from the ensemble decision layer, limiting overfitting and improving transparency. Each stacked model undergoes:

Time-aware k-fold cross-validation
Post-training calibration
Evaluation using Brier Score, AUC, Precision, and Expected Calibration Error (ECE)

To be retained, a stacked model must outperform its calibrated inputs on at least two of these metrics, ensuring each ensemble layer contributes distinct value and not just computational redundancy.

Weighted Averaging of Calibrated Stack Predictions

The final probability forecast for each stock is a weighted combination of the three calibrated stacked model outputs. These weights are stock-specific and derived from each model’s out-of-sample performance:

Brier Score → Reflects calibration accuracy
AUC → Captures discrimination power
Precision → Indicates actionability at high thresholds

Weighting formula:

This approach ensures that no model dominates based on a single strength. Only well-calibrated, discriminative, and precise models contribute meaningfully to the final ensemble. The resulting forecast is a statistically interpretable consensus probability over the next five trading days.

Summary: Ensemble by Design, Not Default

Unlike many retail or commercial platforms that apply ensemble methods haphazardly or opaquely, our architecture reflects disciplined ensemble logic:

Base models use non-overlapping feature sets and calibrated outputs.
Stacks are validated for added value, not complexity.
Final ensembles are weighted by multidimensional performance, not popularity or type.

As highlighted in ensemble studies across finance and ML (Sonkavde et al., 2023; Shrivastav & Kumar, 2022; Gneiting & Raftery, 2007), accuracy alone is insufficient. True robustness requires diverse base learners, post-hoc calibration, and evaluation frameworks grounded in proper scoring rules and decision-relevant metrics.

4 Time-Aware Validation: Real-World Reliability

In financial time series, chronological integrity is essential for trustworthy evaluation. Yet many machine learning studies rely on random or stratified cross-validation techniques that obscure forward-looking performance, introducing temporal leakage and overfitting risks. Our approach avoids these pitfalls by enforcing strict time-aware validation, layered calibration, and metric-based model retention, reflecting best practices in statistical forecasting and real-world trading system design.

Time-Aware Validation Design

Our validation framework uses an 80/20 chronological split:

The first 80% of historical data is used exclusively for training (~3200 samples).
The final 20% is held out for accurate out-of-sample evaluation (~800 samples).

This design simulates real-world deployment, where decisions must be made without knowledge of future data. Each model is trained only on data preceding the prediction window, ensuring that no lookahead bias contaminates model behavior. This structure aligns with the forward-testing rigor advocated in the forecast evaluation literature (Gneiting & Raftery, 2007; Koutsandreas et al., 2022).

Importantly, our stacked meta-models are trained only on out-of-sample calibrated predictions from the base models. This preserves a strict hierarchy: The base models forecast independently, and only their validated outputs feed into the stack. No retraining occurs at the stack level using in-sample predictions, eliminating a common form of cross-validation leakage seen in other ensemble workflows.

Calibration is especially important for tree-based models like Random Forest and gradient boosting (XGBoost), which are known to produce overconfident probability estimates at leaf nodes (Zadrozny & Elkan, 2001). Without post-hoc correction, these models can rank predictions well (high AUC) while still producing misleading confidence scores.

Empirical Integrity: Clean Data Boundaries

Our architecture strictly separates training, validation, and production phases. Base and stacked models are never trained on their evaluation data. No stacked model uses in-sample predictions, and the ensemble is recalibrated after weighted averaging. This structure prevents leakage and ensures all performance metrics reflect true out-of-sample behavior, not backfit optimization.

Calibration of Probabilities

Machine learning models do not inherently produce well-calibrated probabilities, especially in domains characterized by high noise, class imbalance, and structural breaks like finance. To correct for this, we apply formal calibration methods to both base and stacked model outputs:

Platt Scaling (logistic regression–based transformation)
Beta Calibration (distributional warping using a beta function)

The calibration method is selected individually for each model by choosing the method that minimizes the Brier Score (RF models) or LogLoss (XGBoost models) on the validation set. This strategy is supported by empirical findings that show no one-size-fits-all solution to calibration (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017). It also reflects Gneiting & Raftery’s (2007) principle that calibration is necessary for forecast trustworthiness, and that only strictly proper scoring rules can incentivize honest probability estimation.

5 Performance Metrics and Probability Bands

Robust probabilistic forecasting requires more than high classification accuracy, requires proper evaluation of uncertainty, sharpness, and decision relevance. Our framework integrates statistical scoring rules and financial decision metrics to assess models holistically, ensuring they are both probabilistically sound and economically actionable.

Model Evaluation Metrics

Each model—base, stacked, and ensemble-is evaluated using a suite of strictly proper scoring rules (Gneiting & Raftery, 2007; Gneiting & Katzfuss, 2014) and additional performance measures:

Brier Score
A mean squared error between predicted probabilities and observed outcomes. Lower values indicate better calibration.
Logarithmic Loss (LogLoss)
Penalizes overconfident errors, encouraging conservative probabilistic predictions. Sensitive to sharpness and miscalibration.
AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
Measures a model’s ability to discriminate between classes across all thresholds.
Precision (Positive Predictive Value)
Assesses trustworthiness of high-probability forecasts, reflecting decision quality under asymmetric payoff conditions.

Additionally, we track:

Recall, F1 Score
Null Accuracy and Out-of-Bag Error (for RF models)
Entropy-based measures of prediction stability (row-wise and distributional entropy)

All metrics are computed on strictly held-out validation data to preserve out-of-sample integrity and reflect real-world deployment performance.

Probability Bands: Turning Forecasts into Strategy Filters

While raw probabilities offer insight, they are not inherently actionable. To bridge the gap between statistical signal and trading execution, we construct empirical probability bands, which group forecasts into tiers with distinct historical performance characteristics.

Band Design

Forecasted probabilities are segmented into empirically defined bands — e.g.:

0–30%, 30–45%, 45–55%, 55–65%, 65–75%, 75–100%

These bands are calibrated to balance:

Forecast granularity (sufficient resolution across probability ranges)
Statistical reliability (enough data points per band to support analysis)
Strategic alignment (clear translation to trade/no-trade decisions)

For each band, we compute a suite of financial and statistical performance metrics using validation data:

Win Rate: Percentage of trades that achieved a positive return
Sharpe Ratio: Risk-adjusted return, penalizing volatility
Sortino Ratio: Focused on downside deviation
Maximum Drawdown: Measures capital risk exposure
Profit Factor: Ratio of gross gains to gross losses
Median 5-Day Return: Typical short-term outcome

This banding system transforms model outputs into an interpretable signal layer supporting discretionary and systematic strategies. It also empowers investors to set thresholds aligned with their risk preferences and execution constraints.

Use in Strategy Selection

These probability bands serve as more than statistical groupings — they form the execution layer that informs tactical decision-making. Each band is associated with historical return characteristics, enabling differentiated strategy deployment such as:

Swing trades: Using band-level win rates and Sharpe ratios to determine entry conviction and target sizing.
Short puts or calls: Filtering for directional bias and identifying favorable strike windows.
Confidence-based position sizing: Scaling exposure based on the model’s calibrated probability.

For example:

A forecast of 73% may fall into a band historically delivering a 3.1 Sharpe ratio and a 71.5% win rate, supporting a long swing or short put.
A forecast of 32% may correspond to consistent downside movement, offering a high-confidence short-call opportunity.

This structure supports strategy selection without reducing model output to binary rules, preserving probabilistic nuance while informing trade logic.

Band Stability and Data Sufficiency

To ensure statistical robustness, we impose strict minimums on data volume and band granularity:

A minimum of 2,000–3,000 five-day windows is required to stabilize calibration metrics (e.g., Brier, AUC).
Bands are only constructed for stocks with at least ten years of data, ensuring reliable calibration and validation.
Each band must contain at least 100 trade instances (typically 130–150+) to produce stable win rate, drawdown, and Sharpe estimates.

To maximize resolution, we use overlapping 5-day return windows (e.g., Mon–Fri, Tue–Mon, etc.), and each stock is evaluated independently, acknowledging differences in volatility regime and return symmetry across tickers.

6 Reliability, Entropy, and Confidence

Understanding a model’s predictions requires more than just knowing its performance metrics. Traders need visibility into how confident a model is, how well calibrated that confidence is, and how stable its behavior is across time. To support this, we track three diagnostic layers for every model and every forecast:

Reliability (calibration curves)

Entropy (confidence dispersion)
Normalized Confidence Score (decision strength)

These diagnostics are logged for each model and help ensure trust, traceability, and robustness, especially in live environments where overconfidence or regime shifts can impair performance without apparent symptoms.

Reliability Curves and Calibration Plots

A reliability curve (calibration plot) compares forecasted probabilities to observed win rates within specified bins. Predictions in a perfectly calibrated model’s 70–75% bin would result in 70–75% realized success. Deviations from the diagonal line represent underconfidence or overconfidence, common in uncalibrated ensemble outputs.
To evaluate calibration across all modeling layers (base, stack, ensemble), we generate:
Equal-width bin plots (e.g., 10 uniform bins of 0.10 width)
Quantile bin plots (e.g., bins with equal numbers of observations)

These plots clearly show how calibration improves after Platt or Beta correction, how stacked models retain or improve base model reliability, and how the final weighted ensemble preserves probabilistic trustworthiness (Gneiting & Raftery, 2007; Guo et al., 2017).

Entropy: Measuring Prediction Distribution Stability

Entropy provides a complementary perspective on model behavior, measuring how dispersed or concentrated prediction probabilities are.

Row-wise Entropy: Captures confidence at the individual prediction level. Predictions near 0.5 exhibit high entropy (uncertainty); those near 0 or 1 exhibit low entropy (high certainty).
Distributional Entropy: Measures how predictions are spread across the entire test set. A balanced, bell-shaped distribution indicates healthy variation. A bimodal distribution suggests extreme confidence and potential overfitting.

We compute and monitor both entropy types to identify:

Overconfidence (tight clustering near 0 or 1),
Underuse of the central range (suggesting model aversion to uncertainty),
Regime drift or behavioral instability.

Entropy provides early warnings even when traditional metrics like AUC or Brier Score remain steady.

Confidence Score: An Interpretable Signal

To support decision-making, we calculate a normalized confidence score for each prediction. This value integrates:

Distance from 0.5 (neutral probability)
Band-level performance history (win rate, Sharpe)
Stacked model agreement (consensus strength)

The result is a rankable confidence measure between 0 and 1 that enables:

Prioritization of trades
Dynamic capital scaling
Filtering of marginal signals in high-volume portfolios

This score is stored alongside each prediction and can be used in production settings to define exposure rules and trade eligibility filters.

Integrated Diagnostics for Live Monitoring

All three diagnostic layers—reliability, entropy, and confidence—are embedded into the validation pipeline and saved with each model’s metadata. This enables:

Ongoing behavioral monitoring
Real-time alerts for degraded calibration or shifting confidence
Auditability of every model prediction and ensemble outcome

This transparency is essential for avoiding black-box behavior. Unlike many opaque systems that treat model output as final, our diagnostic suite surfaces model behavior and empowers users to understand, trust, or override predictions when necessary.

7 Production Model Workflow and Live Updates

Developing a robust model is only the beginning. In financial markets, the true test lies in operationalizing machine learning systems — maintaining predictive accuracy, calibration integrity, and traceability over time. Our production framework is designed to meet these real-world challenges through a disciplined, version-controlled deployment pipeline and structured update process.

Retraining Base Models on Full Data

Before deployment, each selected base model is retrained on 100% of the available historical data. This allows the final model to:

Incorporate the most recent regime information,
Retain statistical consistency with prior validation results,
Leverage all training data without introducing forward bias.

This final retraining step adheres to best practices in applied ML, where hyperparameters and architecture are fixed prior to full-data fitting (Dietterich, 2000; Brownlee, 2017).

Stacked models, trained using cross-validated base outputs, inherently reflect the whole dataset and do not require separate retraining after final calibration.

Daily Prediction Pipeline

Once calibrated models are in production, a fully automated pipeline generates daily five-day probability forecasts for each covered stock. The pipeline executes:

Up-to-date feature engineering,
Prediction from all eight calibrated base models,
Aggregation through the three calibrated stacked models,
Final output via performance-weighted ensemble.

Each forecast is accompanied by its probability band and associated historical return metrics. This structure ensures that every prediction is verifiable, interpretable, and historically auditable.

Live Monitoring and Model Evaluation

To maintain real-time reliability, we monitor model performance continuously. Logged forecasts are compared to actual five-day returns, and each model is evaluated periodically for:

Calibration drift (based on Brier Score and ECE),
Confidence instability (via entropy metrics),
Band-level underperformance (e.g., drop in Sharpe or win rate).

When degradation is detected, models are flagged for review or early retraining, ensuring responsiveness to market regime shifts and feature decay.

8 Summary and Closing Insights

This paper outlines a complete, research-based architecture for probabilistic machine learning in short-term stock forecasting. Our goal was not to present a generic model but to detail how a disciplined, transparent, and calibrated ML system can consistently translate uncertainty into usable probability, empowering better decision-making for both retail and professional investors.

Where most systems rely on static rules or categorical signals, this framework offers:

Stock-specific modeling, tuned to each asset’s volatility regime, behavioral pattern, and historical structure.
Probabilistic forecasts enable confidence-weighted decisions rather than binary guesses.
Calibration at every stage, ensuring that a 70% forecast is empirically associated with a 70% success rate.
Rigorous validation, using proper scoring rules like Brier Score and LogLoss, alongside financial diagnostics including Sharpe, Sortino, and Profit Factor.
Empirical probability bands, transforming abstract probabilities into context-aware trade filters for swing trades, short puts, and directional call strategies.

Technically, the model is constructed as a three-tier ensemble system, integrating:

Eight base models (Random Forest and XGBoost) trained on diverse, non-overlapping feature sets;
Three stacked models (GLM, RF, XGB) trained solely on calibrated out-of-sample predictions;
A final performance-weighted ensemble of stacked predictions, evaluated across multiple calibration and decision quality dimensions.

These forecasts are validated using a time-aware 80/20 chronological split, assessed with reliability plots and entropy diagnostics, and embedded with confidence scores to support real-world prioritization, scaling, and filtering. Predictions are calibrated and monitored in live deployment, with refreshes based on market regime shifts or evidence of performance decay.

Final Takeaway

In trading, uncertainty is inescapable — but it does not have to be ignored. The power of machine learning lies not in prediction, but in quantifying likelihood with measurable accuracy. A probabilistic forecast does not guarantee an outcome — it estimates how often similar outcomes have occurred in the past, under similar conditions. That difference is profound. It changes how you trade, how you size, and how you manage expectations.

This paper forms part of a larger research and application series. The following white paper will demonstrate applying this probabilistic framework within a concrete strategy: the 90% Cash-Covered Put System, where confidence meets income-generation through systematically chosen, historically bounded strike windows.

Until then, we encourage all traders to ask about any model they use:

Fortune’s winning formula: Tip the scales in your favor with probability-driven, evidence-based trading strategies!

James Krider, MD

GLOSSARY

References

Abolmakarem, S., Abdi, F., Khalili-Damghani, K., & Didehkhani, H. (2024). A multi-stage machine learning approach for stock price prediction: Engineered and derivative indices. *Intelligent Systems with Applications, 24*, 200449. https://doi.org/10.1016/j.iswa.2024.200449

Ayyildiz, N., & Iskenderoglu, O. (2024). How effective is machine learning in stock market predictions? *Heliyon, 10*, e24123. https://doi.org/10.1016/j.heliyon.2024.e24123

Berger, J. O., Bernardo, J. M., & Sun, D. (2009). The formal definition of reference priors. *The Annals of Statistics, 37*(2), 905–938. https://doi.org/10.1214/07-AOS587

Dietterich, T. G. (2000). Ensemble methods in machine learning. In *Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings* (pp. 1–15). Springer. https://doi.org/10.1007/3-540-45014-9_1

Gneiting, T., & Katzfuss, M. (2014). Probabilistic forecasting. *Annual Review of Statistics and Its Application, 1*, 125–151. https://doi.org/10.1146/annurev-statistics-062713-085831

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. *Journal of the American Statistical Association, 102*(477), 359–378. https://doi.org/10.1198/016214506000001437

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In *Proceedings of the 34th International Conference on Machine Learning (ICML)* (pp. 1321–1330). PMLR.

Koutsandreas, D., Spiliotis, E., Petropoulos, F., & Assimakopoulos, V. (2022). On the selection of forecasting accuracy measures. *Journal of the Operational Research Society, 73*(5), 937–954. https://doi.org/10.1080/01605682.2021.1892464

Kumbure, M. M., Lohrmann, C., Luukka, P., & Porras, J. (2022). Machine learning techniques and data for stock market forecasting: A literature review. *Expert Systems with Applications, 197*, 116659. https://doi.org/10.1016/j.eswa.2022.116659

Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In *Proceedings of the 22nd International Conference on Machine Learning (ICML)* (pp. 625–632). ACM. https://doi.org/10.1145/1102351.1102430

Shrivastav, L. K., & Kumar, R. (2022). An ensemble of random forest gradient boosting machine and deep learning methods for stock price prediction. *Journal of Information Technology Research, 15*(1), 1–21. https://doi.org/10.4018/JITR.2022010102

Sonkavde, G., Dharrao, D. S., Bongale, A. M., Deokate, S. T., Doreswamy, D., & Bhat, S. K. (2023). Forecasting stock market prices using machine learning and deep learning models: A systematic review, performance analysis and discussion of implications. *International Journal of Financial Studies, 11*(3), 94. https://doi.org/10.3390/ijfs11030094

Tasnim, S. A., Mahmud, R., Sarker, P., Sayed, A., Siddique, A. B., & Apu, A. S. (2024). A comparative review on stock market prediction using artificial intelligence. *Malaysian Journal of Science and Advanced Technology, 4*(4), 383–404. https://doi.org/10.56532/mjsat.v4i4.316

Tran, P., Pham, T. K. A., Phan, H. T., & Nguyen, C. V. (2024). Applying machine learning algorithms to predict the stock price trend in the stock market – The case of Vietnam. *Humanities and Social Sciences Communications, 11*, 393. https://doi.org/10.1057/s41599-024-02807-x

Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th International Conference on Machine Learning (ICML 2001) (pp. 609–616). Morgan Kaufmann.