Alive to Guess Again

Karl Popper argued that a theory which can't be proven wrong isn't really saying anything. The same is true of engineering practices: if you aren't actively trying to break them, you don't know whether they're working.

philosophy engineering testing agile

The previous posts in this series established a few principles. The cargo cult post argued that practices need reasons. The teacher and doer post explored what real understanding looks like. The pragmatist's razor argued that every decision, whether to follow a principle or deviate from it, needs a justification rooted in context.

But there's a problem with justification that I didn't address. You can justify almost anything if you're allowed to be vague enough. "We do standups because they improve communication." "We write tests because they improve quality." "We use microservices because they improve scalability." These sound like reasons. They have the shape of reasons. But they're missing something important.

Nobody is trying to prove them wrong.

Popper's Razor

Karl Popper was a philosopher of science who spent most of his career on a single question: what separates real science from things that merely look like science? His answer was falsifiability, but the idea goes deeper than most people realise when they first encounter it.

Popper wasn't just saying that theories should be testable. He was saying that science progresses by actively trying to destroy its own theories. You accept a theory provisionally, as the best available explanation, and then you do everything you can to break it. You don't test it in the easy cases. You test it at the extremes, in the conditions where it's most likely to fail. If it survives serious attempts at refutation, it earns its place. Not permanently, but for now. The moment it does fail, you discard it and move on.

The distinction matters. Proving gravity by dropping a ball is trivial. Everyone already knows the ball will fall. The real test is at the boundaries: near a black hole, at quantum scales, in the conditions where the theory might actually break down. Easy confirmations tell you nothing. Hard tests are where knowledge lives.

Popper's classic examples were astrology and certain readings of Freudian psychoanalysis. An astrologer can explain any outcome after the fact. If the prediction was wrong, there's always a reason: another planet was in retrograde, the birth time was imprecise, the subject wasn't receptive. The theory never fails because it can absorb any result. Contrast this with Einstein's general relativity, which made a specific, testable prediction about how light bends around massive objects. If the 1919 eclipse observations had shown no bending, the theory would have been wrong. That vulnerability is exactly what made it valuable.

Or as Popper put it: good tests kill flawed theories; we remain alive to guess again.

I encountered Popper through a recommendation from a mentor, and the moment I understood the argument, I started seeing unfalsifiable claims everywhere in software engineering. Worse, I started seeing them in my own work.

The Unit Test Problem

Here's something I did that taught me this lesson concretely.

I was working on a system and decided it needed better test coverage. This felt like an obviously good decision. Tests improve quality. Everyone knows this. So I went through the existing codebase and wrote unit tests for the code that was already there.

The tests passed. Coverage went up. It felt productive. But I was doing the equivalent of dropping a ball and confirming that gravity works. Every test I wrote verified that the code did what the code already did. I was looking at an implementation, understanding its behaviour, and then writing an assertion that confirmed it. These were easy confirmations. They tested the theory ("this code is correct") in the most comfortable conditions possible: the normal inputs, the happy path, the cases I already knew worked.

What I never did was try to break it. I never asked: "what are the boundary conditions where this logic might fall apart? What inputs would expose a flaw in my assumptions? What's the black hole for this function?" I was accumulating confirmations, not attempting refutations.

The coverage number looked good. But the test suite was unfalsifiable in practice. It couldn't fail in a way that told me anything I didn't already know. If a test broke, it was because someone changed the implementation, not because it caught a genuine behavioural problem. The tests were a mirror held up to the code, reflecting it back at itself.

What I should have done is what TDD actually intends: define the expected behaviour first, then write code to satisfy it, and critically, include the edge cases and boundary conditions where the behaviour might break. A test that says "when a driver completes a session, their lap times are ranked and the fastest is marked" is testing a business rule at its core. But the Popperian step is the next one: what happens when two lap times are identical? What happens when the session has zero laps? What about a session with one lap? Those are the hard tests. Those are the ones that kill flawed implementations.

In a small team, you can't afford to write tests for the sake of coverage. Every test should encode a business rule that, if violated, would cause a real problem. And the most valuable tests are the ones that test that rule in the conditions where it's most likely to break, not the ones that confirm it works in the easy case.

The Pattern Is Everywhere

Once I started looking for practices that had never survived a serious attempt at refutation, I couldn't stop finding them.

Standups. Most teams justify standups as "improving communication" or "keeping everyone aligned." These teams have never tried to falsify the claim. It's not enough to define what success looks like and then passively wait to see whether it happens. The Popperian approach is to actively look for failure. Ask the team: "Did anyone have a coordination problem this week that the standup should have caught but didn't? Did anyone sit through the standup already knowing everything that was said? Did anyone withhold a problem because the format didn't make it safe to raise?"

If you go looking for failure and can't find it, the practice has survived a genuine test. If you find failure immediately, you've learned something valuable. Either way, you know more than you did. But most teams never ask. The standup continues, provisionally accepted but never tested at the extremes. It becomes a ritual that cannot fail because nobody is trying to make it fail.

Code reviews. The justification is usually "catching bugs" or "knowledge sharing." But if you tracked what actually happens in your code reviews, you might find that 90% of comments are about formatting, naming, or style, and almost none catch logic errors. That's the easy test: "do reviews happen?" Yes. The hard test is: "has a code review ever caught a bug that would have reached production? How often? What kind of bugs?" If you go looking for that evidence and can't find it, the practice has been falsified. It's not doing what you claimed it does. Maybe it's doing something else that's valuable, but the original justification is dead and you should update it or drop the practice.

Retrospectives. Teams run retrospectives to "continuously improve." A serious attempt at refutation would be: pull up the action items from the last three retrospectives. How many were completed? How many led to a measurable change in how the team works? If the answer is "we don't track that," the practice has been insulated from failure. You've never tested it at the extremes. You've been dropping the ball and confirming that it falls.

Provisional Acceptance

There's a subtlety in Popper's thinking that changes how I approach all of this. He didn't say that unfalsified theories are "true." He said they're provisionally accepted. They've survived testing so far, and they're the best explanation available, but they could be overturned tomorrow by new evidence. This provisionality is the whole point. The moment you treat a practice as permanently justified, you stop testing it.

This is the difference between "we do standups because they work" and "we do standups because they've survived our attempts to find evidence that they don't work, and we'll keep looking." The first is a settled belief. The second is a living hypothesis. The first can't be wrong. The second invites being wrong, because being wrong is how you learn.

The same principle applies to architectural decisions, technology choices, team structures, deployment processes. All of them should be held provisionally. All of them should be subjected to the hardest tests you can find, not the easiest. And all of them should be discardable when the evidence turns against them.

Why This Is Hard

Unfalsifiable practices survive because actively trying to break your own processes is uncomfortable. If you define what failure looks like and then go looking for it, you might actually find it. That means admitting something isn't working, changing course, possibly having difficult conversations. It's much easier to keep the justification vague and the testing gentle.

Popper noticed the same dynamic in science. Unfalsifiable theories are popular because they're safe. They explain everything, predict nothing, and never require their proponents to change their minds. Falsifiable theories are dangerous. They put themselves on the line. But that danger is exactly what makes them capable of being useful.

The connection to the pragmatist's razor is direct. That post argued that every deviation from a principle needs a specific justification. This post adds: every justification needs to be tested at the extremes, not confirmed in the easy cases. And when a justification fails the test, you have to be willing to let it go. Good tests kill flawed practices. We remain alive to guess again.

What This Changed For Me

I approach testing differently now. Before I write a test, I ask: "what business rule does this encode, and what inputs would break it?" Not the happy path. The edge cases. The boundary conditions. The black holes. Coverage as a metric has become almost irrelevant to me. What matters is whether each test represents a genuine attempt to falsify the assumption that the code is correct.

More broadly, I've started treating every practice as a provisional hypothesis rather than a settled decision. Standups, reviews, architectural patterns: they're all theories about what works, and they all deserve to be tested seriously, not just confirmed gently.

I don't always get this right. The pull toward easy confirmation is strong, and it takes discipline to actively seek evidence that you're wrong. But I think that discipline is what Popper was really arguing for. Not just testability as a logical property, but a habit of mind: the willingness to try to break your own beliefs, and the honesty to update them when they break.

Engineering Philosophy · Part 4 of 4

The Pragmatist's Razor

All Posts