That single sentence launched one of the most famous experiments in psychology. In the 1970s Stanford marshmallow test, children who could wait for a second treat were said to grow into healthier, wealthier, happier adults. Willpower, we were told, is destiny. Really? Is that it?
Then came the revisions. First the Rochester experiment: children who had just experienced an unreliable adult (someone who promised better crayons and never delivered) ate the marshmallow immediately. Children who met a reliable adult waited patiently. Same kids, same room, same treat—just different "priors" about whether promises get kept.
Later, larger studies showed that family income, stability, and education predicted waiting time far better than any innate trait. What looked like a universal law of self-control turned out to be a lesson in learned trust and contextual priors: “Given my past experience, what is the probability the adult actually returns with a second marshmallow?”
We live in a fog of predictions! Politicians promise economic miracles or imminent collapse. Policies changed based on those predictions. CEOs unveil 5-year plans backed by an 87-slide deck presentation. Think-tanks release 400-page reports on climate, AI, or inequality, each claiming to have finally cracked the code. Is it true? Is it just their personal agenda?
Meanwhile, in our day-to-day, we still wrestle with the same ancient questions and our behaviour is driven by our predictions:
- “Will this person really show up on time?”
- “They’re late now, how much longer will they be?”
- “Is this investment safe? For how long?”
- “Will this rule actually help my child?”
How do we decide? Well, these are all marshmallow problems.
And we now have two extraordinarily sharp tools to solve them. As Brian Christian and Tom Griffiths argue in Chapters 6 and 7 of Algorithms to Live By, two remarkably simple tools to let us see through the sleight of hand or confirm a degree of validity: Bayes’s Rule for updating beliefs and a healthy suspicion of overfitting.
Tool 1 - Baye’s Rule
Ask the Questions
Bayes’s Rule is the mathematically precise way to update your beliefs when new evidence arrives.
In plain English it means that you start with what you already believe (the prior), weigh how well the new evidence fits that belief (the likelihood), and produce an updated belief (the posterior).
The formula looks intimidating but is simple once you see the pieces:
Posterior probability = (Likelihood × Prior) / Evidence
Prediction models are created for the sole purpose of influencing our decisions. The higher the “posterior probability”, the stronger the case for the claim.
The fog comes from the fact that this update in belief only works if two hidden choices are correct (which most forecasts hide):
- Your starting prior (what you believed before seeing the new data)
- The distribution that describes how the phenomenon actually behaves in the real world
Most people focus only on the data points, the "evidence". Influencers, honest or not, know the game is won or lost in the choice of distribution and the associated prediction rule. What do we mean by distributions? What are prediction rules?
Distribution: The underlying probability law that describes how the real-world phenomenon actually generates outcomes. It's the full description of all possible futures and how likely each one is. They’re the graphs you’re likely familiar with regards to statistics.
Prediction Rule: A shortcut formula that tells you, given what you’ve already observed, what you should expect next. It's the practical consequence derived from the distribution being used (or assumed).
Here are the three big families of distributions with the associated prediction rule that the book highlights (graphs below for each, taken from the book):
Normal distribution
(the classic bell curve)
Most outcomes cluster tightly around the average. Human height, IQ scores, measurement errors—all roughly normal.
Prediction rule: Average rule
If you’ve waited 10 minutes and the process is normal, expect roughly another 10 minutes more. Things regress to the mean. For height, most people are about the same, extremes are rare, and the graph is a nice symmetric hill.
Power-law distribution
(extreme inequality, “winner-take-all,” fat tails)
A few giants dominate and rare events are orders of magnitude bigger than the rest. Think wealth, book sales, earthquake sizes, social-media, viruses, terrorist attacks, market crashes.
Prediction rule: Multiplicative rule
Multiply the time you’ve already observed by a constant to estimate how much longer it could still go. A 10-minute wait could easily become an hour—or five minutes. The tails never end.
Erlang distribution
(steady, memoryless)
Events arrive one after another at a constant average rate with no clustering or long tails. Classic examples: radioactive decay, text messages, customer arrivals at a quiet bank counter, or (roughly) city buses on a reliable schedule.
Prediction rule: Additive rule
If you just got a text message, expect one very soon. But if you’ve already waited 10 minutes, expect another full 10 minutes on top. The process has no memory of how long it’s been waiting.
As you can see, if you pick the wrong distribution, the prediction rule will leave you in the wrong ballpark. A central-bank economist who models inflation as a nice bell curve will be blindsided by power-law supply shocks. A public-health official who treats pandemic spread as Erlang will underestimate super-spreader events.When someone hands you a confident forecast, two questions instantly expose whether it’s rigorous or rigged:
- “What distribution are you assuming governs this process?”
- “Why is that the right one for this domain?”
Blank stares? That silence is gold—it tells you the model is built on habit, not evidence. Don't be too hasty to adopt the predictions presented.
Solid response? Beautiful–now you have something to talk about!
Tool 2 - Overfitting
Clearing the Fog with Simpler Models
Let's start with the basics: What is a "model" anyway?
In everyday terms, a model is just a simplified map of reality—a way to make sense of data and patterns so we can guess what happens next. Good modeling finds the real signals (the important trends and variables) while ignoring the noise (random wiggles or one-off quirks that don't repeat).
Overfitting is what happens when your model gets too clever for its own good. Instead of sticking to the big-picture patterns, it starts chasing every tiny detail in the data you already have. It fits the past perfectly—like a puzzle snapped together—but when you try it on new situations, it falls apart because those details were just noise, not meaningful signals.
Why is it so common? Well, it certainly looks smart! Adding more factors (like extra variables or rules) always makes the model look better on old data. But that "better fit" often hides risks, like missing the forest for the trees.
Overfitting is everywhere—from personal decisions to expert forecasts—because we live in a world drowning in data, and it's easy to overcomplicate things without realizing it.
Everyday examples make this clearer:
- Imagine interviewing a job candidate. You notice their strong handshake, prestigious school, and clever phrasing in one answer. Your "model" (mental checklist) fits them perfectly as "the ideal hire." But on the job, they struggle—those quirks were noise, not predictors of success.
- Think of a trendy diet tested on 37 people for six weeks. It works wonders for them because the plan tunes into their specific habits, microbiomes, or lifestyles. But for everyone else? It flops—the model overfit those unique details instead of general truths.
- Or an election pollster's algorithm with 73 variables (age, income, weather on voting day, etc.). It nails the last three elections spot-on. Then 2020 hits, and it's way off—those extra factors captured temporary noise, not enduring patterns.
The risk? Overfit models create a false sense of clarity, leading to bad predictions and wasted efforts.
Spotting overfitting clears the fog. How? It reminds us to test models on fresh data or removing some past points to see if the prediction still holds. This "out-of-sample" check reveals if the model truly understands reality or just memorized the old stuff.
Your best defense is embracing simplicity, which often leads to stronger, more reliable predictions:
- Occam’s Razor: Go for the explanation with the fewest assumptions—fewer parts mean less room for noise.
- Regularization: Build in a penalty for adding extra variables. This encourages us to ignore minor details to keep the core strong.
- Early stopping: Stop tweaking when improvements on new data stall—don't chase perfection on the old.
- Heuristics: Quick rules like the recognition heuristic ("Does this remind me of past flops?") help spot overcomplicated promises fast.
As Christian and Griffiths rightly conclude:
By keeping models simple and testing them rigorously, we avoid the traps of overfitting and gain tools to predict—and shape—the future more effectively.
Clarity in the Fog
Using the Tools Together
Put the two tools together and you become dangerous (in the best way):
- Start with explicit priors rooted in your own history—and write them down so you can update them.
- Demand the distribution. If they can’t name it, discount heavily.
Ask: “How does this model perform on data it has never seen?” If the answer is “we didn’t test that,” you can likely just walk away!
Predictions change. They should be continually being revised because each new team should be bringing better priors and simpler models. That’s not failure, it’s progress. But be clear, the priors and distributions must be accurate for any prediction to be valid. Otherwise, we’re at the whim of any influencer with an agenda.
So the next time you’re handed a glossy prediction—whether it’s GDP growth, AI timelines, revenue growth, a future valuation, market adoption, or “just 5 more minutes”, run it through these two filters:
- your priors and
- the distribution.

