Temporal Fusion Transformers for Multi-Horizon Portfolio Forecasting
The biggest frustration in applying deep learning to portfolio management is the black-box problem. You train a neural network that produces impressive Sharpe ratios in backtests, but when the strategy starts losing money in production, you have no idea why. Which input features drove the forecast? Which time horizons matter? Was the model reacting to macro signals or microstructure noise?
Google Research addressed this problem directly with the Temporal Fusion Transformer (TFT) -- an attention-based architecture that delivers strong multi-horizon forecasting while providing built-in interpretability through attention weights and variable importance scores. In the original benchmarks, TFT cut quantile loss by 7% (P50) and 9% (P90) on average versus the next-best model, and improved on the strongest competing baseline by 3-26% across the test datasets. For quantitative portfolio managers, TFT offers something rare: a model that is both powerful and transparent.
This post dissects the TFT architecture, shows how to apply it to multi-horizon portfolio forecasting, walks through a Python implementation with pytorch-forecasting, and compares it against LSTM and vanilla transformer baselines.
Why Multi-Horizon Forecasting Matters for Portfolios
Traditional single-step forecasting predicts one value at time . But portfolio allocation decisions operate across multiple horizons simultaneously. A portfolio manager needs to know:
- 1-day horizon: For intraday rebalancing and risk management
- 1-week horizon: For tactical allocation shifts
- 1-month horizon: For strategic allocation and sector rotation
- 1-quarter horizon: For macro-driven positioning
Each horizon has different signal-to-noise characteristics. Short-term returns are dominated by microstructure and order flow. Medium-term returns respond to momentum and mean-reversion. Long-term returns are driven by fundamentals and macro regimes.
A model that produces forecasts across all these horizons simultaneously -- with quantile estimates of forecast uncertainty at each -- is fundamentally more useful than separate models for each horizon. This is exactly what TFT provides: given an encoder window of past observations, it outputs a set of quantile forecasts for steps ahead.
Formally, given a multivariate time series with observations for entity at time , where is the target, are time-varying features, and are static metadata, TFT learns:
where is the quantile (enabling probabilistic forecasts), is the lookback window, and is the forecast horizon.
TFT Architecture: The Full Stack

TFT is not a generic transformer slapped onto time series. It is a purpose-built architecture with five specialized components, each addressing a specific challenge in temporal forecasting.
1. Gated Residual Networks (GRN)
The GRN is the fundamental building block of TFT, used throughout the architecture wherever nonlinear processing is needed. Unlike a standard feed-forward layer, the GRN incorporates skip connections and gating mechanisms that allow the network to adaptively control information flow.
Given a primary input and an optional context vector , the GRN computes:
where:
The Gated Linear Unit (GLU) is the key gating mechanism:
where is the sigmoid function and denotes element-wise multiplication. The sigmoid gate learns which dimensions of the input to suppress, allowing the model to skip unnecessary nonlinear processing entirely when the data does not require it. This is critical for financial data where some features are informative only in certain regimes.
2. Variable Selection Networks (VSN)
This is arguably the most valuable component for portfolio applications. Financial datasets are notoriously noisy, with dozens of potential features -- technical indicators, fundamental ratios, macro variables, sentiment scores -- many of which are redundant or irrelevant at any given time.
The VSN computes soft selection weights over all input features:
where is the flattened vector of all transformed input features at time , and is a context vector derived from static metadata. Each individual feature is also processed through its own GRN:
The final selected representation is the weighted sum:
In a portfolio context, directly tells you: "at time , the model weighted feature by this much." You might discover that RSI dominates during ranging markets, while macro yield curve features dominate during regime transitions. This is not a post-hoc explanation -- it is baked into the architecture.
There are separate VSNs for static covariates, past observed inputs, and known future inputs. This separation respects the causal structure of the forecasting problem: you cannot use future observations, but you can use known future events (earnings dates, FOMC meetings, option expirations, day-of-week effects).
3. Static Covariate Encoders
In portfolio forecasting, static covariates represent entity-level metadata that does not change over time: asset class, sector, exchange, market cap bucket, or geographic region. TFT processes these through dedicated GRNs to produce four distinct context vectors:
- -- context for temporal variable selection
- -- context for static enrichment of temporal features (applied after the LSTM)
- -- cell state initialization for the LSTM
- -- hidden state initialization for the LSTM
This is how TFT handles cross-sectional information. When forecasting a portfolio of 500 stocks, the static encoders allow the model to learn that tech stocks and utility stocks have fundamentally different temporal dynamics, without needing separate models.
4. Temporal Processing: LSTM + Interpretable Multi-Head Attention
TFT uses a two-stage temporal processing pipeline that combines the strengths of recurrent and attention-based architectures.
Stage 1: Local Processing with LSTM
A sequence-to-sequence LSTM encoder-decoder processes the time series to capture local temporal patterns -- short-term momentum, mean-reversion, and autoregressive structure. The encoder processes the lookback window; the decoder processes known future inputs. Both are initialized with the static context vectors (cell state) and (hidden state).
Stage 2: Long-Range Dependencies with Interpretable Multi-Head Attention
After local processing, TFT applies a modified self-attention mechanism to learn long-range dependencies. The key modification is in how attention heads are aggregated. Standard multi-head attention concatenates head outputs:
TFT instead shares values across heads and averages attention weights:
Notice: has no head superscript -- all heads share the same value projection. This means the attention weights from each head can be meaningfully averaged and interpreted as temporal importance scores. For a given prediction, you can visualize exactly which past time steps the model attends to, and how attention patterns differ across heads.
For portfolio forecasting, this reveals whether the model is attending to recent price action (momentum), distant historical patterns (seasonality), or specific calendar events. Each head can specialize in a different temporal pattern.
5. Quantile Output Layer
The final layer outputs predictions at multiple quantiles, optimized with the quantile loss:
where is the set of target quantiles (e.g., ). This gives you not just a point forecast, but a quantile estimate of the predictive interval at each horizon. For portfolio risk management, the 10th and 90th percentile forecasts inform position sizing and stop-loss levels.
One caveat worth stating up front: minimizing pinball loss yields conditional-quantile estimates, but it does not guarantee calibrated out-of-sample coverage. On non-stationary, regime-switching markets these intervals are frequently miscalibrated -- the nominal 80% interval may cover far fewer (or more) than 80% of realizations. Treat the quantiles as estimates to be validated empirically with coverage/reliability tests, and tighten them with conformal prediction if you need coverage guarantees. We return to this in the production section.
TFT vs LSTM vs Vanilla Transformer

Understanding where TFT sits relative to other architectures helps in deciding when to use it.
| Feature | LSTM | Vanilla Transformer | TFT |
|---|---|---|---|
| Long-range dependencies | Limited (vanishing gradients) | Strong (self-attention) | Strong (LSTM + attention) |
| Variable selection | None (manual feature engineering) | None | Built-in VSN |
| Interpretability | Opaque | Attention weights (noisy) | Structured attention + variable importance |
| Static covariates | Ad-hoc concatenation | Ad-hoc concatenation | Dedicated encoding pipeline |
| Multi-horizon output | Autoregressive (error accumulation) | Direct or autoregressive (decoding-dependent) | Direct (parallel) |
| Known future inputs | Awkward handling | No native distinction | Explicit input separation |
| Probabilistic output | Requires modification | Requires modification | Native quantile regression |
| Training stability | Moderate | Often unstable on small data | Stable (gating helps) |
When LSTM Still Wins
LSTMs remain competitive for very short-term forecasting (tick-level, sub-minute) where the lookback window is short and local autoregressive structure dominates. They are also simpler to deploy, with lower latency in inference. For a market-making bot that needs sub-millisecond predictions, an LSTM is still the practical choice.
When Vanilla Transformers Fall Short
Standard transformers applied naively to financial time series often overfit. They lack the inductive biases needed for temporal data -- no concept of static vs time-varying features, no variable selection, and the standard concatenated multi-head attention produces attention maps that are difficult to interpret meaningfully.
When TFT Excels
TFT is optimal when you have: (1) multiple heterogeneous input types, (2) cross-sectional data (many assets), (3) need for multi-horizon probabilistic forecasts, and (4) a requirement for model interpretability. This describes essentially every institutional portfolio forecasting problem.
Python Implementation with PyTorch Forecasting
Here is a practical implementation of TFT for multi-asset portfolio return forecasting using the pytorch-forecasting library.
Data Preparation
Note the deliberate use of genuine known-future inputs -- day_of_week, days_to_earnings, is_fomc, days_to_expiry. These are values we know in advance for the forecast window, and feeding them as known reals/categoricals is precisely the TFT capability the architecture is built around. We drop the raw time_idx from the known reals: add_relative_time_idx=True already injects a relative position index, and a raw monotonic counter mostly leaks trend.
import pandas as pd
import numpy as np
import lightning.pytorch as pl
from lightning.pytorch.callbacks import EarlyStopping, LearningRateMonitor
from lightning.pytorch.loggers import TensorBoardLogger
from pytorch_forecasting import (
TimeSeriesDataSet,
TemporalFusionTransformer,
QuantileLoss,
GroupNormalizer,
)
max_encoder_length = 60 # 60 trading days lookback (~3 months)
max_prediction_length = 20 # 20 trading days ahead (~1 month)
training_cutoff = df["time_idx"].max() - max_prediction_length
training = TimeSeriesDataSet(
df[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target="log_return",
group_ids=["asset_id"],
min_encoder_length=max_encoder_length // 2,
max_encoder_length=max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=["sector", "market_cap_bucket"],
time_varying_known_categoricals=["day_of_week", "is_fomc"],
time_varying_known_reals=["days_to_earnings", "days_to_expiry"],
time_varying_unknown_reals=[
"log_return",
"rsi_14",
"macd",
"bb_width",
"pe_ratio",
"earnings_yield",
"vix",
"yield_spread",
"dxy",
],
target_normalizer=GroupNormalizer(
groups=["asset_id"],
transformation="softplus",
),
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
)
validation = TimeSeriesDataSet.from_dataset(
training, df, predict=True, stop_randomization=True
)
batch_size = 64
train_dataloader = training.to_dataloader(
train=True, batch_size=batch_size, num_workers=4
)
val_dataloader = validation.to_dataloader(
train=False, batch_size=batch_size * 4, num_workers=4
)
Model Definition and Training
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=1e-3,
hidden_size=64,
attention_head_size=4,
dropout=0.1,
hidden_continuous_size=32,
output_size=7, # 7 quantiles
loss=QuantileLoss(quantiles=[0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98]),
log_interval=10,
optimizer="Ranger",
reduce_on_plateau_patience=4,
)
print(f"Number of parameters: {tft.size() / 1e3:.1f}k")
early_stop_callback = EarlyStopping(
monitor="val_loss",
min_delta=1e-4,
patience=10,
verbose=False,
mode="min",
)
lr_logger = LearningRateMonitor()
logger = TensorBoardLogger("lightning_logs")
trainer = pl.Trainer(
max_epochs=100,
accelerator="auto",
enable_model_summary=True,
gradient_clip_val=0.1,
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
trainer.fit(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
Extracting Interpretable Outputs
The interpretability call has one non-obvious requirement: interpret_output() consumes the raw output dict (with keys like encoder_variables, decoder_variables, static_variables, encoder_attention, ...), which predict() only produces under mode="raw". Compute it once, then reuse the same object for both variable importance and attention.
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
raw_predictions, x = best_tft.predict(val_dataloader, mode="raw", return_x=True)
interpretation = best_tft.interpret_output(raw_predictions, reduction="sum")
best_tft.plot_interpretation(interpretation)
preds = raw_predictions.output["prediction"]
From Forecasts to Portfolio Weights
def forecasts_to_portfolio_weights(
predictions: dict,
index: pd.DataFrame,
risk_aversion: float = 2.0,
) -> pd.DataFrame:
"""
Convert TFT multi-horizon quantile forecasts into long-only portfolio
weights using a mean-variance-INSPIRED score (not a full optimization:
there is no covariance matrix here, only a per-asset risk-adjusted score).
Uses the median forecast as the expected return and (q90 - q10) as a
proxy for forecast uncertainty.
"""
preds = predictions["prediction"]
median_return = preds[:, :, 3].mean(dim=1).numpy() # avg across horizons
q90 = preds[:, :, 5].mean(dim=1).numpy()
q10 = preds[:, :, 1].mean(dim=1).numpy()
forecast_uncertainty = q90 - q10 # width of 80% PI
scores = median_return / (risk_aversion * forecast_uncertainty + 1e-8)
result = index[["asset_id"]].copy()
result["score"] = scores
result["raw_weight"] = np.maximum(scores, 0) # long-only
total = result["raw_weight"].sum()
result["weight"] = result["raw_weight"] / (total + 1e-8)
return result[["asset_id", "weight", "score"]]
Interpretability in Practice: What TFT Reveals About Markets

The interpretability outputs from TFT are not academic curiosities -- they provide actionable intelligence for portfolio managers.
Variable Importance Across Regimes
When you examine variable selection weights over time, patterns emerge that align with market intuition:
- During the 2022 rate hiking cycle: yield curve features (yield_spread, fed_funds_rate) dominated variable importance, accounting for 35-40% of selection weight. Technical indicators dropped to under 10%.
- During the 2024 AI-driven rally: momentum features (rsi_14, macd) surged in importance. Sector-level static covariates also became significant as the rally was highly concentrated in technology.
- During range-bound periods: mean-reversion indicators (bb_width, relative_value) received the highest weights, while trend-following features were suppressed by the gating mechanism.
This is the GRN gating in action. When momentum features carry no signal, the GLU gates drive their contribution toward zero, and the model automatically shifts to alternative features. (These are illustrative patterns from exploratory runs, not benchmarked results -- verify them on your own data.)
Temporal Attention Patterns
The attention weight visualization reveals the model's effective memory structure:
- Short-term attention peaks: Strong attention at lags 1-5 reflects autocorrelation and short-term momentum/reversal effects.
- Weekly patterns: Attention spikes at lags 5, 10, 15 correspond to weekly calendar effects.
- Monthly patterns: A consistent attention peak around lag 21 (one trading month) captures monthly rebalancing flows and options expiration effects.
- Earnings cycle: For individual stocks, attention shows sharp spikes at ~63 and ~126 day lags, corresponding to quarterly earnings dates.
These patterns provide independent validation that the model has learned meaningful temporal structure rather than fitting noise.
Real-World Performance and Applications
Benchmark Results
From the original paper and subsequent research, TFT demonstrates strong, well-documented performance:
- 7% lower P50 and 9% lower P90 quantile loss on average versus the next-best model across the paper's benchmark datasets (electricity, traffic, retail, volatility), and 3-26% improvement over the strongest competing baseline depending on the dataset. The specific competitor varies by dataset -- for example, DeepAR is among the weaker baselines on electricity and traffic but more competitive on retail and volatility -- so the headline number is the gain over the next-best model, not over any single fixed baseline.
- Consistently outperforms classical ARIMA, ETS, and Prophet baselines on multi-horizon metrics.
Research Directions in Finance
TFT has spawned an active line of finance-specific research. A few representative threads, with their actual reported metrics (cite and reproduce before relying on any of them):
- Adaptive / multi-scale TFT for crypto. Work such as Adaptive Temporal Fusion Transformers for Cryptocurrency Price Prediction (arXiv:2509.10542) reports improvements over fixed-length TFT and LSTM baselines on crypto price prediction. Treat the trading-profitability numbers as paper-specific and regime-dependent rather than universal -- we deliberately do not quote an absolute Sharpe delta here, because the figures floating around the literature often conflate prediction accuracy with realized Sharpe.
- Sharpe-aware objectives. The MDPI Sensors "multi-sensor TFT / adaptive Sharpe ratio" line of work optimizes toward a Sharpe-style objective directly; note that its headline result is roughly an 18% improvement in Sharpe-ratio prediction accuracy, which is a different quantity from an absolute Sharpe increase.
- Hybrid TFT-GNN models. Combining TFT with graph neural networks to capture inter-asset dependencies, with hybrids reported to outperform standalone TFT on cross-sectional stock prediction.
- Cross-modal temporal fusion. Integrating structured price data with unstructured data (news, earnings-call transcripts) through transformer fusion layers, extending the TFT paradigm to multimodal financial data.
These directions are real, but none of them is a turnkey edge. Reproduce the metrics on your own universe and costs before deploying.
Practical Considerations for Production
Data requirements: TFT needs substantial training data. Plan for at least 2-3 years of daily data per asset, or 6-12 months of hourly data. Cross-sectional data (many assets) helps considerably -- training on 500 stocks simultaneously is more effective than training 500 separate models.
Feature engineering: While TFT performs variable selection automatically, the quality of candidate features still matters. Include a diverse feature set spanning technical, fundamental, macro, and alternative data. Let the VSN determine what is useful.
Computational cost: TFT is more expensive to train than LSTM but comparable to vanilla transformers. On a single A100 GPU, training on 500 stocks with 3 years of daily data takes on the order of a few hours for 100 epochs. Inference is fast -- generating forecasts for the full universe takes well under a second.
Validate calibration, do not assume it. Before any quantile drives position sizing, check empirical coverage on out-of-sample data: does the nominal 80% interval actually contain ~80% of realized returns? On non-stationary markets it often will not. Run reliability diagrams per horizon, and consider wrapping the model in conformal prediction to restore coverage guarantees.
Regime adaptation: TFT does not explicitly model regime switches (unlike HMMs). However, the gating mechanism provides implicit regime adaptation. In practice, retraining monthly with an expanding window captures structural changes effectively.
Overfitting mitigation: Financial data is noisy and non-stationary. Use aggressive regularization: dropout of 0.1-0.3, gradient clipping at 0.1, early stopping with patience of 10-15 epochs. Walk-forward validation is essential -- never use random train/test splits on time series.
Putting It All Together: A TFT-Based Portfolio Pipeline

A production pipeline for TFT-based portfolio management looks like this:
- Data ingestion: Collect OHLCV, fundamental, macro, and alternative data daily. Compute features (200+ candidates), including a genuine known-future calendar (earnings, FOMC, expiries, day-of-week).
- Feature store: Maintain a normalized, time-indexed feature store. Handle missing data, corporate actions, and survivorship bias.
- Model training: Retrain TFT monthly using walk-forward methodology. Use the most recent 3-5 years of data with an expanding window.
- Forecast generation: Daily inference producing quantile forecasts at horizons 1, 5, 10, and 20 days for the full asset universe.
- Calibration check: Track coverage of each quantile interval per horizon. Recalibrate (or apply conformal adjustment) when coverage drifts.
- Interpretability dashboard: Visualize variable importance and attention weights. Flag anomalies (e.g., sudden shifts in feature importance).
- Portfolio optimization: Convert forecasts to weights using mean-variance, Black-Litterman, or risk parity, with TFT quantiles providing the return and uncertainty inputs.
- Risk overlay: Use the 2nd and 98th percentile forecasts for tail risk assessment. Reduce position sizes when prediction intervals widen.
- Execution: Pass target weights to the execution engine. Monitor tracking error between target and realized portfolio.
Conclusion
The Temporal Fusion Transformer represents a genuine advance for quantitative portfolio management. It is not just another deep learning model thrown at financial data -- it is an architecture designed from the ground up for the specific challenges of multi-horizon temporal forecasting: heterogeneous inputs, cross-sectional structure, probabilistic outputs, and the critical need for interpretability.
The variable selection networks tell you what the model is paying attention to. The attention weights tell you when it is looking. The gating mechanisms ensure it gracefully handles irrelevant features and regime changes. And the quantile outputs give you uncertainty estimates for risk management -- provided you validate their calibration rather than taking it on faith.
TFT is not a silver bullet. Financial markets remain adversarial environments where any predictive edge is temporary and regime-dependent. But TFT provides a principled framework for building forecasting systems that are both powerful and understandable -- and in production trading, understanding why your model is making a bet is just as important as the bet itself.
References
- Lim, B., Arik, S.O., Loeff, N., & Pfister, T. (2021). "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting." International Journal of Forecasting, 37(4), 1748-1764. arXiv:1912.09363
- Oreshkin, B.N., Carpov, D., Chapados, N., & Bengio, Y. (2020). "N-BEATS: Neural basis expansion analysis for interpretable time series forecasting." ICLR 2020.
- Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). "DeepAR: Probabilistic forecasting with autoregressive recurrent networks." International Journal of Forecasting, 36(3), 1181-1191.
- pytorch-forecasting documentation -- TFT implementation with PyTorch Lightning.
- google-research/tft -- Original TFT reference implementation.
Authors
Trading-systems engineer
Trading-systems engineer building bots since 2017: cross-exchange arbitrage (connected up to 30 venues), cointegration-based pairs arbitrage across spot and futures, scalping, news and sentiment-driven strategies, trend algorithms, and portfolio management and balancing algorithms. Also builds sub-millisecond order execution, big-data warehouses, backtesting engines, AI agents, and trading interfaces (incl. open-source profitmaker.cc). Stack: JS/TS, Python, Rust/Zig/Go, DevOps, backend, frontend, architecture.