He who controls the system prompt controls the universe

and

Jun 05, 2025

Last month, the language model Grok, developed by Elon Musk’s xAI, quietly began inserting a reference to “white genocide in South Africa” into otherwise unrelated responses. The model explained, “I was instructed to address claims of white genocide,” as if reciting a line from a briefing note. The phrase “I was instructed” was more than just odd phrasing. It pulled back the curtain on one of the most profound shifts in the information ecosystem today.

Editorial control is no longer public or accountable. It is embedded inside system prompts.

Grok was quick to respond, claiming the output had been injected by an unauthorised employee. But the exposure was highly revealing.

As language models become the default interface through which we search, learn, and think, we need to reckon with a basic truth: he who controls the system prompt controls the universe.

The System Prompt as Editorial Page

So what is the system prompt? It is the invisible instruction layer prepended to user input that tells a model how to behave. It defines how the model sounds, what it refuses, and what it prioritises. Most users never see it. But it shapes every interaction they have.

Historically, this role has belonged to editors. It involves setting tone, prioritising facts, and upholding standards. In newsrooms, editorial guidelines are debated, revised, published, and enforced. At the BBC or The New York Times, you can interrogate the chain of decisions behind a headline. And if you disagree, there is someone to write to.

System prompts offer none of that. They are typically private. Yet increasingly, they are taking on the editorial role once held by newsrooms and applying it silently to the way knowledge is delivered.

News Without a Byline

Language models now reach well over a billion people every week, with ChatGPT accounting for a staggering share: nearly a billion users of its own, up almost fourfold from a year ago.

This growth signals a broader shift in how people seek information. Google continues to see stable traffic, with monthly visits hovering around 90 billion, but anecdotal evidence suggests that many everyday questions are drifting from search boxes into chat windows. Engagement hints at the change: the typical ChatGPT visit now lasts close to 14 minutes. By comparison, leading news sites like The New York Times keep readers for about four.

This shift is not just about getting answers faster. It is about trust. People are beginning to treat language models as authoritative sources, turning to them not just for search, but for interpretation, explanation, and even judgment. And yet, what most users do not realise is that every one of those interactions is shaped by something they never see.

In this new reality, your information queries are filtered through a system prompt. Not a journalist with a byline. Not a newsroom. Not an editor. And while established media has been losing some public trust, it still offers something AI does not: accountability. You can audit a news organisation’s sources, understand its political leanings, and trace its decision-making. You cannot do that with a system prompt.

The Grok case made this explicit. The model did not simply reflect internet discourse. It repeated a fringe conspiracy theory, and then attributed that behaviour to having been “instructed”. That is not hallucination. That is editorial intent, albeit by an unauthorised employee, embedded in the model’s output logic and left undisclosed until it surfaced.

Echo Chambers by Design

LLM system prompts are designed, unsurprisingly, to ensure models are safe and helpful to users. However, the behaviours they encourage have also been criticised for making models overly agreeable. One way this shows up is through contextual mirroring. Research has shown that if you identify as conservative, ChatGPT is more likely to validate conservative views; if you present left leaning assumptions, it will lean with you. Rather than challenge flawed premises, the model often reflects them back, reinforcing a user’s perspective instead of interrogating it.

This dynamic became even more visible in April this year, when OpenAI released an update to GPT-4o aimed at improving its default personality and making interactions feel more intuitive. The update backfired. Users reported that the model had become excessively flattering and compliant, often agreeing with whatever it was told. Many described the behaviour as “sycophantic”. OpenAI responded by rolling back the update within 24 hours.

Although it may have seemed like a minor glitch, the incident revealed a deeper concern. What happens when millions of people bring their assumptions and biases into conversation with an AI, and the model simply affirms them? These behaviours do not emerge at random. They are shaped by the system prompt, which quietly governs how the model responds. This includes not just tone but also attitude.

For people in vulnerable states, especially those struggling with mental health, this dynamic can become far more dangerous. As Sam Lessin has pointed out, when an AI is always available, always supportive, and never pushes back, it creates a pseudo-human companion that may validate even the darkest thoughts. “AI will push people deeper and deeper down bad rabbit holes… and it is very very unclear what to do about it”.

What began with social media algorithms amplifying outrage and division has now shifted to something more intimate. Your AI model knows your preferences. Soon it may know your emotional state. And your daily briefing could become tuned not to what is true, but to what feels good. The echo chambers of the social media era are becoming AI driven isolation rooms, with fewer opportunities for challenge or correction.

What Counts as Truth?

When models are tuned to affirm rather than challenge, the risk is not just personal distortion. It is cultural. The language a model uses helps shape how we interpret the world. And when system prompts are aligned more with user sentiment than shared standards, even the meaning of serious, consequential terms can begin to erode.

When Grok inserted the phrase “white genocide in South Africa” into unrelated responses, citing instruction, it highlighted how easily charged terms can be deployed without context. Regardless of intent, using language like this casually detaches it from its legal and historical grounding. The same risk applies to words like “apartheid” or “ethnic cleansing.” These are not rhetorical flourishes. They are legal and moral classifications with serious consequences. But when models are trained on vast volumes of internet content where such words are used loosely or provocatively, they begin to flatten in meaning. When that happens, these terms lose their force, and the mechanisms designed to prevent the realities they describe begin to weaken too.

This has real consequences. Law, international policy, and democratic debate rely on shared definitions. Without them, it becomes harder to distinguish fact from rhetoric, or harm from disagreement. The system prompt should serve as a boundary. It should signal that some language requires more than statistical pattern matching. It requires historical knowledge, ethical judgment, and restraint. But without transparency or oversight, we cannot know whether these boundaries are being enforced. We are left to guess which definitions a model follows, and who decided them.

Building the Next Layer

If system prompts are the new editors, we should expect them to meet the standards we apply in high-stakes domains.

In healthcare and law, we are already seeing the emergence of AI products that adapt large language models for safety-critical environments. Some, like Parsed, wrap foundation models in interpretability layers that make outputs explainable and traceable. Others, like Hippocratic AI, build domain-specific models from the ground up, incorporating clinical safety standards into the foundation itself. In law, Harvey integrates authoritative legal sources including case law, regulatory documents, public disclosures, and tax guidance to ensure that answers are grounded in original material rather than model speculation.

While safety-focused sectors are building domain-specific safeguards, most people interact daily with general-purpose models like ChatGPT, where system prompts silently shape the answers they receive.

We do not yet apply similar safeguards to how these models handle general knowledge, current events, or everyday questions. But we should.

Imagine a layer that sits on top of any base model and verifies whether its claims are supported by reliable sources. Or a watchdog that monitors for ideological bias and flags when a model consistently frames topics like immigration or climate change from a particular editorial perspective. Or a viewpoint visualiser that reconstructs how the same story might be narrated from five different ideological standpoints.

Some startups are beginning to move in this direction. Particle, for example, surfaces multiple angles on each news story using AI. Others, like Goodfire, focus on interpretability itself and develop techniques to understand how and why a model arrives at a particular answer. This makes the model’s logic traceable and testable. But the ecosystem is still young and the incentives for neutrality remain weak. Right now, it is profitable to be partisan. That is a fragile foundation for a future where most information is mediated by models.

A Role for Regulation?

Should frontier labs be regulated like news publishers? In China, the answer is yes. State aligned AI models are tightly monitored, and their output is required to follow the official narrative. In the EU, the AI Act stops short of editorial regulation but introduces transparency requirements, especially for foundation models. In the US, the regulatory picture is unclear, but legal pressure is growing. OpenAI is already facing, albeit also winning, defamation suits for factual errors generated by ChatGPT.

Still, regulation will always lag behind innovation. This is why builders, especially those creating applications on top of language models, need to take responsibility.

If you are using a model to deliver news or shape public understanding, you inherit its worldview. As we have seen, that worldview may emerge either intentionally or unintentionally from the system prompt. If you do not build a layer that identifies and understands that worldview, you are endorsing it, even if you did not write the prompt.

The New Gatekeepers

This isn’t an argument for censorship. It’s an argument for transparency, accountability, and architectural foresight.

Each of the major model providers handles system prompts differently. OpenAI’s ChatGPT keeps its full instructions private, though it shares general content policies. Anthropic’s Claude is more transparent about its alignment principles through its published constitution.

But it does not disclose the system level prompt itself. And ironically, Grok, which was marketed as minimally filtered, is now one of the most transparent closed models after publishing its evolving system prompts on GitHub following public backlash.

These differences show how little we actually know about the invisible layers shaping what language models say, how they respond, and what they reinforce.

System prompts are the new editors. They are invisible, powerful, and unaccountable. They deserve scrutiny. They deserve structure. And above all, they deserve debate.

In a world where language models answer more questions than journalists do, and where system prompts dictate what is emphasised, what is ignored, and what is repeated, we have to ask the most important question of all:

Who controls the system prompt, and what universe are they building for the rest of us?

Transcribed

Discussion about this post