Performance Evaluation: Judging Trading Strategies and Managers (Chapter 22, Part 1)
Here’s a question that keeps coming up in investing: how do you know if a fund manager is actually good, or just lucky?
Larry Harris tackles this head-on in Chapter 22, and honestly, the answers are kind of uncomfortable. Most of the methods people use to judge investment performance are way less reliable than they think. But let’s start from the beginning.
The Core Problem: Skill vs. Luck
Every portfolio’s performance depends on two things: the quality of its management, and a bunch of random factors that nobody could have predicted.
Good managers add value. Bad managers waste it. But macroeconomic shifts, industry disruptions, surprise events, and pure randomness all affect returns too. A restaurant chain stock might drop 35% because of an E. coli outbreak at one franchise. No amount of research could have predicted that.
So when you look at a portfolio that did well, you’re seeing some mix of good management and good luck. When you see one that did poorly, it could be bad management or bad luck. Separating the two is the fundamental challenge of performance evaluation.
And here’s what makes it worse: investment policies often force managers into or out of certain positions regardless of their skill. A portfolio that must stay fully invested in equities will rise in a bull market even if the manager is terrible. “A rising tide lifts all boats,” as the saying goes.
Absolute vs. Relative Performance
The simplest way to measure performance is absolute: how much did the portfolio value change? If your portfolio went from 100 to 120, you earned 20%. Done.
But that number is basically useless without context. An equity portfolio that drops 10% when the market drops 20% actually performed great. One that rises 15% when the market is up 30% performed terribly.
This is why analysts use relative performance. They compare portfolio returns against a benchmark, something that represents how the portfolio would have done without active management. For U.S. large-cap stocks, that’s usually the S&P 500.
Harris tells the story of the Beardstown Ladies, an investment club that wrote a best-selling book claiming they earned 23.4% annually over 10 years. When someone actually checked, their real return was 9.1%, which was way below the S&P 500’s 14.9%. They thought they were beating the market when they were actually getting crushed by it. They used the wrong frame of reference.
Market-Adjusted and Risk-Adjusted Returns
The most common relative measure is the market-adjusted return: your portfolio return minus the market index return. Simple and useful.
But you can go further with risk-adjusted returns. This is where beta comes in. Beta measures how much a security fluctuates with the market. A stock with a beta of 0.5 only moves half as much as the market, up or down.
A manager who builds a low-beta portfolio will underperform in rising markets and outperform in falling ones. The opposite is true for high-beta portfolios. To account for this, analysts compute risk-adjusted excess returns (also called “realized alpha”) by subtracting the portfolio beta times the market return from the raw return. This helps determine whether a manager is actually picking good investments after you strip out their market exposure.
You can also break things down further with market-timing returns. This measures whether the manager is skillfully adjusting the portfolio’s beta to ride market swings. So raw return equals the market return plus the market-timing return plus the risk-adjusted excess return. It’s a clean decomposition.
The Performance Prediction Problem
Here’s where things get really sobering. Most people evaluate past performance because they want to predict future performance. But for that to work, three conditions must all be true:
Past performance must reflect skill, not just luck. If someone got lucky, their past returns tell you nothing about the future.
The manager’s skills must remain effective. Market conditions change. Skills that worked in a bull market might be useless in a bear market. Regulation FD killed the advantage of managers who used to get inside information from corporate interviews. Their past performance became irrelevant overnight.
The manager must still have those skills. People age, lose motivation, lose key employees, or lose access to resources. A formerly skilled manager might not be skilled anymore.
If any one of these conditions fails, past performance is worthless as a predictor. And here’s the empirical reality: financial researchers have found essentially no correlation between the best-performing funds in one year and the best-performing funds the next year. The worst performers do tend to stay at the bottom (they trade too much and charge high fees), but top performers rotate randomly.
This result is robust across equity funds, bond funds, and commodity pools. It holds across years and across countries. It doesn’t depend on how you define “best.”
The Statistical Reality
Harris walks through the math, and it’s devastating. The t-test is the standard statistical tool for determining whether a manager can systematically beat the market. The problem is power: how likely is the test to identify a truly skilled manager?
Assume a skilled manager can beat the market by 2% per year on average (which is realistic given the competitive environment). With five years of monthly data, a test at 95% confidence has only a 15% chance of correctly identifying that skilled manager. Even with ten years of data, it’s only 23%.
To put that in perspective: if you want a test that correctly identifies an unskilled manager 75% of the time AND correctly identifies a skilled manager 75% of the time, you need 22 years of data. Twenty-two years.
Even Warren Buffett’s numbers get complicated under this lens. Over 36 years, Berkshire Hathaway beat the market by 11.8% annually. But his performance dropped to 6.8% annually in the last 10 years of that period. Was the early performance skill, luck, or both? Peter Lynch showed a similar pattern at Fidelity Magellan, going from 12.7% outperformance overall to just 5.1% in his last five years.
The Bottom Line
Harris makes the statistical case for indexing more clearly than almost anyone else. When you run the optimal test to decide between an active manager and an index fund, the expected benefit of using the test is tiny. With ten years of data, it’s only 8.6 basis points. For most investors, it’s simply not worth trying to identify skilled managers from their track records.
Most professional managers would be “delighted beyond description” if they were certain they could beat the market by 2% per year. That’s how competitive the game is. And our brains are hardwired to believe that past performance predicts the future, because for our evolutionary ancestors, it usually did. But trading is not survival. Trading is a zero-sum game against other smart competitors.
In Part 2, we’ll look at the deeper problems with statistical evaluation, including the devastating sample selection bias and what actually does predict performance.
This post is part of a series retelling “Trading and Exchanges: Market Microstructure for Practitioners” by Larry Harris (Oxford University Press, 2003). The goal is to make these concepts accessible to everyone, not just finance professionals.