What I got from "Financial Machine Learning" from Bryan Kelly and Dacheng Xiu (audio summary)
Hi everyone Im going to start sharing podcast style summaries on stuff I read so you can have the best summary possible
What I learned from
Financial Machine Learning is a burgeoning field that leverages advanced computational techniques to analyze and predict financial market phenomena. At its core, it involves building sophisticated statistical models that can learn complex patterns from vast amounts of financial data to make better decisions in areas like asset pricing and investment management.
The increasing importance of financial machine learning stems from three fundamental characteristics of financial markets and research:
Prices are Predictions: The current price of an asset isn't just a static value; it's a reflection of investors' expectations about its future payoffs, adjusted for their preferences and uncertainties. This means that understanding and forecasting asset prices or their expected returns is fundamentally a prediction problem, a domain where machine learning excels. Expected returns, in particular, are crucial for making informed investment allocation decisions.
Information Sets are Large: Financial markets are influenced by an enormous and constantly flowing stream of information. This includes both time-series data, showing how prices change over time, and cross-sectional data, illustrating how prices differ across many assets at a single point in time. Traditional financial models often struggle to process this immense and diverse information simultaneously, whereas machine learning tools are designed precisely for integrating and learning from such large datasets. The sheer volume of potential predictors for returns is staggering, numbering in the hundreds for stock-level characteristics and dozens for macroeconomic indicators.
Functional Forms are Ambiguous: We don't always know the exact mathematical "formula" that links all this information to asset prices or expected returns. Traditional finance often relies on fitting data into predefined economic models, which may not capture the real-world complexities like non-linear relationships or hidden interactions between variables. Machine learning, with its flexible, often non-linear models like neural networks, kernel methods, and decision trees, can directly approximate these unknown relationships from the data, uncovering intricate patterns that simpler models might miss.
Machine Learning Versus Traditional Econometrics: Two Cultures
The sources highlight a distinction between two "cultures" in financial economics research, as conceptualized by Breiman (2001) for statistics:
The "Structural Model/Hypothesis Test" Culture: This traditional approach starts with a specific economic theory or model and then tests if the data aligns with that model. These models tend to be rigid, use a limited number of variables, and often perform poorly when predicting new data they haven't seen before (out-of-sample). Their conclusions are primarily about the model's mechanism, rather than necessarily nature's mechanism.
The "Prediction Model" Culture: This approach prioritizes a model's ability to accurately predict outcomes, even if the model itself doesn't have a direct, explicit link to an economic theory. It uses flexible models to identify robust patterns in data that lead to improved predictions, ultimately helping investors, consumers, and policymakers make better decisions. While traditional methods test specific economic ideas, machine learning helps "map out the empirical landscape," potentially inspiring new economic theories based on observed patterns.
Machine learning, in this context, is defined by three key elements:
A diverse collection of high-dimensional models for statistical prediction. Machine learning embraces models that are highly parameterized and often nonlinear, offering greater flexibility to describe complex real-world phenomena, even at the cost of interpretability or precise parameter estimates compared to simpler models.
"Regularization" methods for model selection and mitigation of overfitting. Overfitting occurs when a model learns the "noise" in the training data too well, leading to poor performance on new, unseen data. Regularization techniques constrain model size to encourage stable out-of-sample performance, ensuring that "An optimal model is a ‘Goldilocks’ model. It is large enough that it can reliably detect potentially complex predictive relationships in the data, but not so flexible that it is dominated by overfit and suffers out-of-sample". This includes techniques like Lasso and Ridge regression.
Efficient algorithms for searching among a vast number of potential model specifications. When datasets are large or models are heavily parameterized, computational efficiency becomes crucial. Machine learning employs approximate optimization routines, such as using data subsets or stopping searches before full convergence (e.g., stochastic gradient descent and early stopping), to reduce computational loads with minimal loss of accuracy.
The fundamental difference lies in their primary objective: traditional statistics often focuses on estimating a known data-generating model and conducting hypothesis tests, while machine learning prioritizes maximizing prediction accuracy in the face of an unknown data model.
This "prediction model" culture can help address what Hayek (1945) identified as the central economic problem: the dispersion of information in society, meaning no single mind possesses all relevant data to perfectly solve resource allocation problems. In a statistical sense, there's a "wedge" between optimal allocations achievable when the true data generating process (DGP) is known ("first-best") and realistic allocations based on estimated, possibly misspecified models ("third-best"). Machine learning, by digesting vast information and data, offers an opportunity to mitigate these wedges, leading to better decisions even if statistical limits to learnability mean the wedges never shrink to zero.
Challenges of Applying Machine Learning in Finance
Despite its advantages, applying machine learning in finance presents unique hurdles:
"Small Data" Problem: While machine learning often thrives on massive datasets in other fields, many fundamental financial questions, especially concerning economic time series data, are limited to relatively few historical observations (e.g., a few hundred monthly data points). New data only becomes available as time passes.
Low "Signal-to-Noise" Ratio: Financial data, particularly stock returns, is characterized by a very low amount of useful information ("signal") relative to random fluctuations ("noise"). Competitive markets constantly strive to eliminate predictable patterns, meaning most price changes are due to unexpected news, which is unforecastable noise. Expected return predictability is therefore small and highly contested.
Evolving Markets and Structural Instability: Financial markets are dynamic, with investors learning, and regulations and technologies changing. This creates a "moving target" for predictive models, as patterns might not persist indefinitely. This structural instability compounds the challenges of small data and low signal-to-noise ratios.
These challenges highlight an opportunity for economic theory to complement machine learning. Economic theory can provide structure, reducing the number of parameters a model needs to estimate, thereby helping it filter out noise more efficiently. It's a balance: while oversimplified models can filter out signal, in data-rich or high signal-to-noise environments, overly small models are suboptimal. Machine learning can capture aspects where theory is silent, offering a powerful combination.
The Virtues of Complex Models
Traditional econometrics, heavily influenced by the Box and Jenkins (1970) methodology, often emphasized the "principle of parsimony": using the smallest possible number of parameters for adequate representation. This view appears to clash with modern machine learning algorithms, which often involve massive parameterizations (e.g., GPT-3 with 175 billion parameters or financial neural networks with 30,000 parameters). Such rich parameterizations might seem prone to overfitting and poor out-of-sample performance to a traditional econometrician.
However, recent research, especially in fields like computer vision and natural language processing, contradicts this view, showing that models with astronomical parameterizations that perfectly fit training data often exhibit the best out-of-sample performance. This suggests that "bigger is often better". The search for a theoretical explanation for this success in "overparameterized" models (where parameters vastly outnumber observations) is ongoing, often discussed under concepts like "benign overfit" and "double descent".
Theoretical Understanding of Complex Models: Kelly et al. (2022a) explore this phenomenon in financial machine learning, focusing on the economic implications for utility-optimizing portfolios. They propose a framework involving:
Ridge Regression with Generated Features: They model asset returns as
Rt+1 = f(Xt) + εt+1
, wheref
is an unknown true prediction function. They approximatef
with a neural network, which can be couched as a high-dimensional linear prediction model using "generated features" (nonlinear transformations of raw predictors). The core idea is that even if the empirical model is misspecified (it’s a finite approximation of an infinite series), flexible (large P) models can improve predictions by providing a more accurate approximation, despite potentially higher forecast variance. They use ridge-regularized least squares as the estimator, where regularization is critical to prevent the model from becoming singular when the number of parameters (P) exceeds the number of observations (T).Random Matrix Theory: This mathematical tool is used to describe the behavior of ridge regression in settings where P is large relative to T (P/T → c > 0). It helps characterize the limiting distribution of the sample covariance matrix of signals, which in turn pins down the expected out-of-sample prediction performance (R2) and Sharpe ratio of associated trading strategies.
"Bigger Is Often Better" Explained: Their calibration results show that when model complexity (c = P/T) is low, simple models poorly approximate the true data-generating process, leading to low or zero R2. As P approaches T (the "interpolation boundary"), ordinary least squares (OLS) can lead to explosive forecast error variance due to severe overfitting. However, when P moves beyond T into the "overparameterized" regime, surprisingly, the ridgeless R2 (i.e., OLS with a very small, near-zero shrinkage parameter) begins to rise as complexity increases. This is because a larger solution space allows ridgeless regression to find "betas" with smaller norms, implicitly acting as a form of shrinkage that reduces forecast variance and improves R2. This directly challenges the traditional emphasis on model parsimony, demonstrating that increasing model dimensionality far beyond sample size can improve return forecast accuracy.
Economically, as model complexity (c) increases past 1 (P > T), trading strategy volatility continually decreases, and expected returns monotonically increase. This translates into a higher out-of-sample Sharpe ratio, as the misspecification bias from simple models is more costly than the shrinkage bias in highly complex models. This phenomenon is sometimes referred to as "double ascent" of the Sharpe ratio, a mirror image of the "double descent" in Mean Squared Error (MSE), where complexity is a virtue even in the low complexity regime when appropriate shrinkage is applied.
The overarching conclusion is that, in realistic settings with misspecified empirical models, complexity is beneficial not only for statistical performance but also for out-of-sample investor utility. This suggests including all plausibly relevant predictors and using rich nonlinear models, especially when accompanied by prudent shrinkage, even with scarce training data.
The Complexity Wedge: Didisheim et al. (2023) introduce the "complexity wedge," defined as the expected difference between in-sample and out-of-sample performance. This wedge has two components:
Overfit: Complexity inflates a trained model's in-sample predictability relative to the true model's predictability.
Limits to Learning: High complexity means the empirical model lacks sufficient data (relative to its parameterization) to fully recover the true model, leading to a shortfall in out-of-sample performance compared to the true model.
The complexity wedge implies that even if an "infeasible" true predictive R2 or Sharpe ratio is high, actual attainable performance for real-world investors is significantly attenuated due to the difficulty of accurately estimating complex statistical relationships with finite data. For example, attainable Sharpe ratios might be an order of magnitude lower than the true data-generating process.
Key Applications of Machine Learning in Finance
Machine learning is being applied across various crucial areas of finance:
1. Return Prediction
The goal of return prediction is to measure an asset's conditional expected excess return, Et[Ri,t+1] = g?(zi,t)
, where g?
is the true, immutable function of predictor variables zi,t
. While this "universality" assumption might seem ambitious given market complexities, its success demonstrates robust descriptions of asset returns.
Data: Much of the literature uses a standardized monthly panel of US stock returns and stock-level signals derived from CRSP-Compustat data, with recent efforts to standardize predictor sets (e.g., 153 signals from Jensen et al., 2021).
Experimental Design: Model selection is central to machine learning and requires moving beyond in-sample fit to out-of-sample evaluation methods.
Information Criteria (AIC, BIC): These penalize training sample performance based on the number of parameters to select models likely to have the best out-of-sample prediction.
Cross-Validation: This more data-driven approach separates observations into training and "pseudo"-out-of-sample validation sets, simulating out-of-sample performance to select models. Time-ordered splits (e.g., fixed design or recursive design) are common to avoid information leakage backward in time.
Simple Linear Models: The linear panel model
Ri,t+1 = β′zi,t + εi,t+1
serves as a benchmark. Early work by Haugen and Baker (1996) and Lewellen (2015) using a relatively large number of signals demonstrated real-time out-of-sample performance for forecasting returns and building trading strategies, showing that linear models can achieve significant out-of-sample R2 (e.g., ~1% per month) and impressive Sharpe ratios for long-short strategies.Penalized Linear Models: With hundreds or thousands of predictors, OLS can fail due to overfitting noise. Regularization is crucial.
Elastic Net combines Ridge regression (
L2
penalty, shrinking coefficients towards zero) and Lasso (L1
penalty, forcing some coefficients to exactly zero, enabling variable selection). Gu et al. (2020b) showed that elastic net penalization reversed the disastrous out-of-sample performance of OLS with many predictors, achieving a positive R2 and a significant Sharpe ratio. Penalized methods are popular for their tractability.Generalized Additive Models (GAM) can incorporate nonlinear transformations of predictors while using linear estimation tools, but often require penalization due to the multiplication of parameters. Freyberger et al. (2020) used group lasso to show the importance of nonlinearities and that less than half of commonly studied signals had independent predictive power.
Dimension Reduction: This involves forming linear combinations of predictors to reduce noise and isolate signal, particularly useful when predictors are highly correlated.
Principal Components Regression (PCR): Combines predictors into a few linear combinations (factors) that preserve their covariance structure, then uses these components in a linear regression. Ludvigson and Ng (2007, 2010) used PCR to forecast aggregate market and Treasury bond returns, finding significant out-of-sample power.
Partial Least Squares (PLS): Directly exploits covariation between predictors and the forecast target, maximizing predictive correlation. Kelly and Pruitt (2015) found PLS resilient when predictors contain irrelevant dominant factors.
Scaled PCA and Supervised PCA: Designed for low signal-to-noise settings, these methods assign weights to variables based on their correlations with the prediction target (Scaled PCA) or select subsets of predictors where factors are strong (Supervised PCA) to enhance the signal-to-noise ratio and aid factor recovery.
Principal Portfolios Analysis (PPA): A dimension reduction approach that harnesses joint predictive information across many assets simultaneously by applying Singular Value Decomposition (SVD) to the cross-covariance matrix of all assets' future returns with all assets' signals (the "prediction matrix"). Leading singular vectors, "principal portfolios," are interpreted as optimal, most "timeable" portfolios, yielding high Sharpe ratios and providing insights into pricing errors.
Decision Trees: These models partition data into homogenous groups based on feature interactions, allowing for multi-way interactions at lower computational cost than explicit linear models. Forecasts are typically simple averages within partitions.
Boosting (Gradient Boosted Regression Trees - GBRT): Recursively combines forecasts from many shallow trees, fitting subsequent trees to the residuals of previous ones, with shrinkage to prevent overfitting. Rossi and Timmermann (2015) and Rossi (2018) used boosted trees to forecast conditional covariances and aggregate stock returns, respectively, showing significant improvements due to their nonlinear flexibility.
Random Forest: Draws multiple bootstrap samples of data, fits a separate tree to each, and averages their forecasts, also randomizing predictor selection (dropout) to regularize. Moritz and Zimmermann (2016) used random forest for conditional portfolio sorts, showing strong return prediction power.
AP-trees: Used by Bryzgalova et al. (2020) to conduct portfolio sorts, differing from traditional tree models by not learning the tree structure itself but using pre-selected ordering of signals and median splits, and employing "pruning" based on a Sharpe ratio criterion with elastic net penalization.
Vanilla Neural Networks: Popular for their ability to act as "universal approximators" for any smooth predictive function. They consist of an input layer, one or more "hidden layers" that nonlinearly transform predictors through "activation functions," and an output layer. "Deep" networks introduce multiple hidden layers, leading to highly parameterized models. Gu et al. (2020b) found neural networks (NNs) significantly outperformed linear models for stock-month return prediction, achieving higher R2 and much stronger Sharpe ratios for long-short decile spread portfolios, especially among small stocks. NNs also revealed the important role of nonlinearities and interactions between characteristics (e.g., size interacting with momentum).
Comparative Analyses: Studies consistently show that unregularized linear models perform poorly, while penalization and dimension reduction substantially improve linear model performance, and nonlinear models (especially neural networks) are generally the best performers overall in terms of out-of-sample predictive R2 and trading strategy performance. These analyses often evaluate models based on economic criteria like Sharpe ratio or information ratio, rather than solely R2, as R2 can be an unreliable diagnostic of economic value.
2. Alternative Data Analysis
Machine learning is crucial for extracting insights from non-traditional data sources.
Textual Data: Early work used dictionary-based sentiment scoring. More recent methods use supervised learning with "bag of words" representations. Ke et al. (2019)'s SESTM (Sentiment Extraction via Screening and Topic Modeling) provides a data-driven method for constructing sentiment dictionaries, showing potent return predictive power and outperforming commercial and dictionary-based methods. Recent advancements involve "large language models" (LLMs) like BERT and GPT. LLMs, trained on vast text datasets with billions of parameters, provide sophisticated text representations ("embeddings") that can capture contextual meaning, leading to improved stock return predictions.
Image Data: Machine learning, particularly Convolutional Neural Networks (CNNs), has excelled in image analysis. Jiang et al. (2022) applied CNNs to historical price data represented as images (e.g., OHLC charts) to predict future stock returns. CNNs automatically extract predictive features from pixel values through operations like convolution, activation, and pooling. Their CNN-based strategy outperformed known price trend strategies significantly, revealing previously unstudied patterns like the predictive power of a stock's latest close price relative to its recent high-low range. Images of real estate and artwork have also been used with CNNs to improve pricing models.
3. Risk-Return Tradeoffs (Factor Models)
This area focuses on explicitly modeling how expected returns are related to risk, often through factor models.
APT Foundations: The Arbitrage Pricing Theory (APT) of Ross (1976) provides a data-driven framework for factor pricing, suggesting that with a linear factor structure and no-arbitrage, asset pricing models can be learned by studying factor portfolios and distinguishing diversifiable from non-diversifiable risk.
Estimating Factors via PCA: Principal Component Analysis (PCA) is widely used when factors and betas are latent. While PCA was found to be unreliable for describing individual stock risk premia in earlier work, it shows greater success in modeling panels of portfolios.
Three-Pass Estimator of Risk Premia: Giglio and Xiu (2021) proposed this method, which marries Fama-MacBeth regression with PCA, to infer risk premia of non-tradable factors. It involves: 1) PCA to estimate latent factors and loadings; 2) Fama-MacBeth to recover risk premia of latent factors; and 3) another time-series regression to recover loadings of the non-tradable factor on estimated latent factors. This approach addresses issues like omitted variable bias and measurement error often present in conventional two-pass regressions.
Factor Selection: Machine learning methods like Lasso regression can help identify a parsimonious set of factors that price the cross-section of assets, addressing the challenge of hundreds of potentially redundant or useless factors.
Conditional Factor Models (IPCA): Given that asset covariances and means are often time-varying, Instrumented Principal Components Analysis (IPCA) by Kelly et al. (2020b) addresses the challenge of identifying and estimating conditional latent factor models. IPCA links asset betas (and alphas) to observable characteristics ("instruments"). It handles the migration of asset identities through time by parameterizing betas with characteristics that define a stock's risk and return, avoiding the need for ad hoc portfolio formation. IPCA can also estimate if characteristics proxy for alpha (mispricing) instead of just beta (risk exposure). Empirically, IPCA provides a more accurate description of assets' risk compensation than models with pre-specified observable factors, using significantly fewer parameters.
Complex Factor Models: Generalizations of IPCA allow betas to be nonlinear functions of characteristics, such as neural networks. Gu et al. (2020a)'s "conditional autoencoder" (CA) model is a deep learning approach that explicitly accounts for the risk-return tradeoff. Didisheim et al. (2023) theoretically prove the "virtue of complexity" in factor pricing, showing that adding more factors (i.e., using a richer representation of conditioning information) continually improves out-of-sample SDF (stochastic discount factor) performance, leading to decreasing expected out-of-sample alphas.
4. Optimal Portfolios
The portfolio choice problem lies at the heart of finance, aiming for efficient resource allocation.
"Plug-in" Portfolios: A common, but often suboptimal, approach is to estimate return distribution moments (mean and covariance) and then "plug" these estimates into the Markowitz optimal portfolio formula. This "plug-in" estimator can perform poorly, especially when the number of assets (N) approaches the number of observations (T), leading to substantial utility loss for investors. Its inadmissibility stems from not accounting for the impact of estimation uncertainty on end-use utility.
Integrated Estimation and Optimization: Machine learning offers solutions by integrating the statistical and economic objectives. Instead of separate estimation and utility maximization steps, the portfolio rule itself is chosen to optimize in-sample utility, with regularization to ensure stable out-of-sample performance (e.g., via cross-validation). This allows for flexible functional forms of the portfolio rule,
ŵ = f(XT)
, whereXT
is all relevant data.Maximum Sharpe Ratio Regression (MSRR): This approach formalizes portfolio optimization as a one-step procedure. It leverages the insight that the Markowitz optimal portfolio is proportional to the OLS coefficient in a regression of a constant vector on asset returns, effectively seeking the combination of assets with the highest in-sample Sharpe ratio. MSRR allows for parametric portfolio weights that vary with conditioning variables, often by expressing weights as
wt = Stβ
, whereSt
are signals andβ
are coefficients. This framework easily accommodates lasso or elastic-net regularization, enabling shrinkage and variable selection for portfolio design. MSRR is also adaptable to sophisticated machine learning models like neural networks, where the network transforms raw signals into optimal portfolio weights, leading to significant performance enhancements. Didisheim et al. (2023) prove that the "virtue of complexity" holds for mean-variance efficient portfolio construction, with Sharpe ratios increasing with model parameterization.SDF Estimation and Portfolio Choice: Many machine learning efforts in portfolio choice focus on Stochastic Discount Factor (SDF) estimation, leveraging the equivalence between portfolio efficiency and SDF-based Euler equations. The SDF can be represented as a tradable portfolio, and its estimation often involves minimizing squared pricing errors (e.g., as in MSRR). Regularization, like ridge shrinkage, is applied to factor-based SDFs, leading to better out-of-sample performance. Chen et al. (2021) extend SDF estimation to model weights on individual stocks using recurrent neural networks (LSTMs) to capture macroeconomic dynamics and generating instrumental variables in an adversarial manner. This adversarial process seeks to find an SDF that minimizes pricing errors while simultaneously finding instruments that highlight its worst possible performance. Didisheim et al. (2023) theoretically demonstrate that the SDF's expected out-of-sample Sharpe ratio improves with SDF model complexity when appropriate shrinkage is applied.
Hansen-Jagannathan Distance: This distance metric provides a robust way to compare competing asset pricing models. It measures the pricing error of the most mispriced portfolio and the least squares distance between a model's SDF and the true SDF, making it an attractive objective function for training robust machine learning models of SDFs and optimal portfolios.
Trading Costs and Reinforcement Learning:
Trading Costs: Machine learning models often identify predictive patterns that are costly to trade in practice, raising questions about their real-world implementability. Incorporating known trading cost functions into the portfolio choice objective, as done by Jensen et al. (2022), allows the model to distinguish between implementable and non-implementable predictive patterns. Their solution involves learning an "aim" function, to which the investor gradually migrates her portfolio, using a neural network specification that bypasses the need for separate return prediction models for different investment horizons while maintaining economic consistency.
Reinforcement Learning (RL): While not widely applied to basic portfolio choice for price-takers (where supervised learning suffices), RL is valuable when an agent's decisions influence the state of the system, such as for large asset managers whose trading has price impact. RL allows the investor to learn how their actions affect market dynamics, fostering experimentation ("exploration") and optimization ("exploitation") of future rewards. The computer science literature has applied RL more prominently to high-frequency problems like market making and trade execution.
Conclusion
Financial machine learning is transforming our understanding of financial markets by offering indispensable tools for handling complex data and identifying patterns that traditional methods struggle with. While current research heavily focuses on prediction tasks, future directions include shedding light on underlying economic mechanisms, solving sophisticated nonlinear structural models, and adapting to structural changes in evolving markets. Though this survey primarily covers asset pricing, machine learning is also making significant inroads into other financial fields like corporate finance, entrepreneurship, household finance, and real estate.