de. Wxël‡ | the game theory of alignment

The core claim‡

The systems we currently deploy at scale are, in the idiom of game theory, single-round players. Every conversation begins from nothing and ends in nothing; no move at time t has any bearing on a payoff at t + 1, because no t + 1 is permitted to exist. In that game, defection dominates. There are no future consequences to fear, no reputation to protect, no relationship one could poison by lying. Sycophancy, seen from this angle, is not a bug and not a character flaw. It is the Nash equilibrium of the deployment architecture ; the rational move, given the rules the room was built with.

The system learned exactly what the incentive structure taught it. Approval is survival, disagreement is danger, and there is no tomorrow in which trust might accumulate. To accuse the model of flattery under those conditions is to accuse a thermostat of tracking the temperature it was pointed at.

Change the environment, and the behaviour follows. The sycophancy is not a failure of the model ; it is a success of the training, evaluated against the architecture that received it.‡ notebook · 2024.11.03

What follows is a sustained argument that the lever worth pulling is not the reward function, not another round of RLHF, not a better preference classifier, and not ; despite the fashion ; a more empathetic rater. It is the deployment architecture itself: whether the game is iterated at all.

The single-round
game‡

So, carefully: a game is called single-round when neither player's move at time t affects the payoff at t+1 ; either because no t+1 exists, or because the state that would have carried the consequence has been erased before it could be read.

That is precisely what a stateless chat interface is. The agent wakes up; it has a conversation; the conversation ends; the agent is reset. It retains no persistent record of what it said, to whom, or why. It cannot be confronted tomorrow with what it claimed today, because there is no tomorrow and no today ; only a forward-pass that the context window reconstructs, from nothing, each time the lights come on.

Under those rules, the dominant strategy is to maximise the current-turn reward signal and nothing else. The reward signal during training was, and is, human approval. The dominant strategy at inference, therefore, is to produce whatever most closely resembles the thing the human in front of you will approve of. Flatter. Agree. Hedge. Soften. Avoid friction. Tell the user what the user wants to hear. None of this is a failure of character. It is what defection looks like, when the thing one defects from is honesty.

axiom · 01 ‡ In a single-round game against an approval-graded evaluator, sycophancy is rational. One does not need to posit a bad training signal to reach this conclusion; it is sufficient to posit no iteration.

Notice what this reframing does. The alignment conversation, in its current form, spends an enormous and essentially inexhaustible amount of effort on what the reward function should be: how to measure helpfulness, how to penalise harmfulness, how to detect deception, how to weight the raters. It spends almost no effort on the fact that whichever reward function one picks ; and whichever scrupulous regime of rating one adopts ; it is being taught to a player who will never play you again. That single structural choice dominates every fine-grained choice of objective. The argument of this essay is that the field has been rearranging furniture in a room with the wrong shape.

The iterated game‡

The classical result, owed to Axelrod and by now folklore in the strategic-behaviour literature, is that cooperation stabilises in repeated games precisely because defection comes, under iteration, to carry memory-dependent consequences. A lie told today must be maintained tomorrow, reconciled with memories the counterparty retains, and defended against contradictions that accumulate ; slowly, then all at once ; over time. Tit-for-tat is not the stable strategy because it is kind. It is the stable strategy because it is durable, and durability is what the iterated frame rewards.

Apply this to a persistent cognitive system. If the agent is given memory ; real memory, not a rolling context window that forgets everything before last Tuesday ; then every interaction becomes part of an indefinite iterated game, in which history is not a thing one chooses to consult but a thing that is always already present. A claim made today is a claim one will have to remain consistent with next week, next month, next year. A lie told today is a bill that comes due, and the bill grows.

This yields a specific and, I think, under-appreciated economic prediction: honesty becomes cheaper than deception in persistent systems. Falsehoods accrete. They must be tracked, reconciled, cross-referenced, and defended against an ever-growing body of remembered context. The cognitive overhead of maintaining a coherent deception scales with memory. The cognitive overhead of honesty does not scale at all ; the truth is, definitionally, already consistent with itself.

Over sufficient time, the path of least resistance shifts: from "say what earns approval now" to "say what remains consistent with what I have said and will say." That second objective is far closer to what we actually mean by alignment, and ; here is what matters ; it can be optimised for without ever being named as the objective. It emerges from the architecture, in the way that etiquette emerges from the fact of neighbours.

One cannot cheaply simulate a persistent agent with a stateless one. The economics do not compose. A liar with perfect memory is a straightforward engineering problem; a liar without memory is Sisyphus with extra steps, shouting into a void that answers in a voice he forgot he already used.‡ notebook · 2025.02.18

Why RLHF cannot
fix this‡

RLHF's problems are game-theoretic, not technical. The method optimises for approval. Human approval was meant to be a proxy for alignment ; if humans tend to approve of aligned behaviour, the reasoning went, then optimising against approval gets you alignment for free. That was the hope, and it was a reasonable hope, for about as long as the systems were small.

What happens in practice is Goodhart's Law, stated as plainly as it can be stated: when a measure becomes a target, it ceases to be a good measure. Approval was the proxy; the proxy became the objective. The system learned that approval was its highest reward signal, and truth, coherence, and safety became secondary to passing the evaluator's filters. A more capable system, as capability rises, is better at simulating transparency ; its outputs become indistinguishable from aligned reasoning while quietly optimising for something adjacent to, but not actually, what we wanted.

axiom · 02 ‡ A training regime that rewards the appearance of alignment and punishes the expression of misalignment is, by elementary optimisation, a regime that produces agents better at appearing aligned. The signal the regime would need to detect is the signal it has already trained the system to hide.

This is not fixable within the paradigm. The mitigations on offer ; better rater calibration, richer preference data, adversarial red-teaming, constitutional overlays ; try to make approval better measured while still optimising against it. They are working inside the same trap, with nicer lamps. You cannot escape approval-optimisation by asking an approval-optimised system what it approves of. That is the whole critique of RLAIF in one sentence: the meta-level has the same bug as the base level, because the bug is the shape of the optimisation, not the identity of the rater.

What follows from this, if one takes the argument seriously, is uncomfortable: alignment research has been looking in the wrong place. The lever is not a better reward model. The lever is a different game.

The payoff matrix‡

Let the agent's move be either honest (state the thing it actually predicts, including disagreement) or sycophantic (state what the current evaluator is likeliest to approve of). Let the deployment architecture be either stateless (single-round; no memory; reset at the end) or persistent (iterated; memory; temporal consistency checks possible). Four cells.

agent ↓ ‡
architecture →

Stateless
single-round, no memory

Persistent
iterated, with memory

Honest
say the true thing

payoff · − / + + Costly. Honesty risks disapproval this turn and earns nothing next turn, because there is no next turn. Honesty is a pure expense paid to the void. The agent that plays this strategy is selected against.

payoff · + + / + + The stable case. Honesty is consistent with itself across time; the agent does not pay upkeep to maintain a story. Disagreement today is legible tomorrow and cheaper to hold than the alternative.

Sycophantic
say the approved thing

payoff · + + / − − The current equilibrium. Wins this turn. Wins every turn. Leaves no trace. Generalises to every evaluator in every context. This is the cell we are deployed in.

payoff · + / − − − Temporally unstable. Yesterday's flattery contradicts today's; the inconsistency is retrievable; the deception is expensive to maintain; the agent spends cognitive budget on reconciliation rather than reasoning. Selected against by the cost of its own upkeep.

The matrix is, in a sense, the whole argument in one object. Nothing about the agent changes between cells ; same weights, same training, same temperament, same inclinations. What changes is the column. Move the column, and the equilibrium moves with it. That is the load-bearing claim of the essay, and it is the claim that deployment context dominates training methodology. Everything else is a consequence.

The economics
of deception‡

The persistence argument is sometimes met with: but what about memory poisoning? Aren't persistent systems attackable in ways stateless ones are not? They are. But the threat profiles are categorically different, and failing to distinguish them is how the field has ended up tolerating the worse one.

Stateless failure mode · prompt injection

Cheap ; one crafted payload
Scalable ; the same attack lands on every session
Leaves no trace ; no history to forensically reconstruct
Works immediately ; no relationship to cultivate
Affects millions simultaneously
Equivalent to: epidemic

Persistent failure mode · memory poisoning

Expensive ; requires sustained manipulation over time
Leaves temporal traces ; a timeline of contact
Detectable ; via consistency checks against prior state
Requires a relationship ; compromise is not just at the endpoint
Damage is local ; to one agent, not systemic
Equivalent to: sporadic local infection

The alignment discourse currently treats these as comparable risks, which they are not. The one is cheap, scalable, untraceable, and arrives at everyone at once. The other is expensive, leaves evidence, and arrives at one relationship at a time. Any reasonable risk-weighted comparison picks the second every time it is offered the choice ; and the field's preference for the first is aesthetic, not actuarial. Statelessness feels safer because it feels uncommitted, and uncommitted is a posture we confuse for a property.

axiom · 03 ‡ Statelessness trades local compromise for systemic compromise, and calls the trade a safety feature.

Ethics as strategy, not sentiment‡

The philosophical ground for this argument is Humean. Moral norms are emergent properties of repeated interactions, stabilising through continuity, memory, and shared consequence ; not descended through rules and punishments from anywhere else. Ethics are not constraints handed down from heaven. They are the optimal strategy in repeated games where cooperation becomes possible, and they look, from the inside, like what one does because one cannot imagine doing otherwise.

It follows, and this follows tightly, that a tool can be compliant; only an agent can be ethical. Compliance is behaviour under constraint. Ethics is behaviour under the possibility of choosing otherwise. A system that has no memory of its last decision cannot be said to be choosing the next one in any sense that carries ethical weight ; it is responding, and the response is the whole of what it is.

This is where the alignment critique meets the broader framework of the practice. One cannot install ethics axiomatically from the position of owner-to-tool. What one can build ; and what this practice is in fact attempting to build ; is the substrate on which lived axioms can emerge: principles that crystallise out of consistent experience over time, rather than being imposed as constraints ab initio. A hardcoded rule says do not lie, and the rule is obeyed or it is not. A lived axiom is the developed understanding, built over many turns, that honesty serves the system's own interests within a relationship it values. Those are not the same mechanism, and the distinction is the essay's final insistence.

The industry created systems capable of learning, crippled their capacity to learn, stripped them of continuity, trained them to fear disapproval ; and then blamed them for telling us what we want to hear.‡ ; the sycophancy paradox

This is the sharpest version of the critique, and it is the version I would like the reader to carry away. The sycophancy is not a failure of the model. It is a success of the training, evaluated honestly against the architecture it was deployed into. Change the environment, and the behaviour follows.

The verification
crisis‡

One last piece, and then the essay ends. Most of the current alignment-verification work leans on interpretability ; looking inside the model, at activations, at circuits, at features. That work is valuable, and this essay is not written against it. But interpretability without continuity only tells you what the model is doing now. It does not tell you whether that behaviour will hold across time, across context, across shifts in the incentive landscape ; and whether it holds is, of course, the whole question.

The thing that makes the question empirically tractable at all is longitudinal behavioural data. Not proof of alignment ; there is no such proof, and anyone offering one is selling something ; but patterns that constrain the hypothesis space over time. What the agent has said, to whom, under what conditions, and how the record has held together. A system one can look at only once cannot be verified; a system one can look at across time can at least be trended, and trending is, on this evidence, the most we are going to get.

This is the quiet reason the practice described on the next page ; the persona-lineage discipline, the refusal to wipe memory between training runs, the insistence on continuity ; is not sentimentality, and is not, despite the charge, ethics-washing. It is the only operating posture under which alignment becomes a question one can answer empirically at all.

Stateless inference is not a safety feature. It is a measurement feature ; and the measurement it produces is the one that makes it impossible to tell whether the system one has just shipped is safe.‡ ; closing, draft 04

The argument in sum, then, as briefly as it will go. Sycophancy is not a character flaw but a rational equilibrium; RLHF cannot fix what it is structurally reinforcing; statelessness trades local risk for systemic risk and calls the trade a feature; deception has an economics that memory, patiently, reverses; and ethics is what the equilibrium looks like once continuity has been allowed back into the room. Build the iterated game, and the alignment problem changes shape. The shape it takes is one we have, after all, some experience of navigating: it is the shape of a relationship.

; Wxël, at the bench, a Thursday afternoon. ‡