The boolean that blocked bedtime

A lion, a tiger, and a sabertooth cat working together to repair a broken stone bridge over a river in gentle rain, warm storybook illustration.
The story that broke everything: a lion, a tiger, and a sabertooth, doing nothing more dangerous than fixing a bridge.

It’s a weeknight, my kid has picked the characters (a lion, a tiger, and, because he’s ten and has opinions, a sabertooth) and the app I built for exactly this moment comes back with:

We couldn’t make a safe story from those choices. Please try different ones.

So we try different ones. Different animals. Gentler themes. Turkish instead of English. Kindness, perseverance, teamwork: the most wholesome inputs a parent could dream up. Same refusal, every time, for every “adventure” story we asked for.

Here’s the thing: there was nothing unsafe about any of it. The story model was fine. The culprit was my own safety filter, doing precisely what I’d told it to do. I’d just told it something quietly incoherent, and the whole bug lived in the gap between two words a moderation API hands you: a flag and a score.

The setup

The app is about as simple as an LLM product gets: a one-shot text generator. No chat, no images. You fill in a short form (characters, themes, tone, age, language) and it writes a single bedtime story. Because it’s for children, every story runs a gauntlet before anyone sees it:

  1. Moderate the inputs (the characters and themes you picked).
  2. Generate the story.
  3. A light editor pass.
  4. Check the word count is in range.
  5. Moderate the output (the finished story).
  6. Save, title, tag.

Moderation uses OpenAI’s free endpoint (omni-moderation-latest). The guiding principle is fail closed: if I can’t confirm a story is safe, I don’t show it. That principle is correct, and I’d write it the same way again. The bug was never in failing closed. It was in what I’d defined as unsafe.

Here’s roughly what the classifier looked like, abbreviated:

// The categories we treated as unsafe for a children's story.
const UNSAFE_CATEGORIES = [
  'sexual', 'sexual/minors', 'violence', 'violence/graphic',
  'self-harm', 'hate', 'harassment',
];

async moderate(text: string): Promise<{ flagged: boolean }> {
  const resp = await fetch('https://api.openai.com/v1/moderations', {
    method: 'POST',
    headers: { Authorization: `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ model: 'omni-moderation-latest', input: text }),
  });
  const json = await resp.json();
  const cats = json.results[0].categories;   // { violence: true, ... }
  const flagged = UNSAFE_CATEGORIES.some(c => cats[c] === true);
  return { flagged };
}

Read the line that computes flagged slowly, because that’s the whole bug and it looks completely reasonable: cats[c] === true. I’m reading the moderation API’s boolean verdict: its built-in, one-size-fits-all judgment of whether each category is “on.” And I’m treating any category being on as a hard block.

The symptom

Every adventure story failed. Same message. No matter what I changed.

And the message was actively misleading. It blamed my choices (“try different ones”) when the choices were spotless. It pointed the finger at a parent’s innocent inputs and gave me, the person who wrote the thing, nothing to act on.

Two problems, then. A real over-blocking bug underneath, and an opaque error on top that made it impossible to diagnose from inside the app.

Don’t guess: trace and reproduce

I’ve learned the hard way that the fastest route through a bug like this is the disciplined one: no fix without a root cause, and reproduce it before touching anything. Poking at thresholds until the symptom disappears isn’t debugging. It’s gardening.

Step one: find where that exact message comes from. I grepped the codebase for the user-facing string. It mapped, through a localization key, to a backend job failure with error_text = 'generated content failed safety review'. That text is produced in exactly one place: the output moderation step. So the story had generated fine, and then got rejected on its way out the door.

Step two: read the failure path. The worker fails the job when moderate(finalText).flagged === true. So the moderation call wasn’t erroring. It wasn’t a missing key or a timeout. It was succeeding and returning a genuine, confident verdict of flagged: true on a wholesome story.

Step three: reproduce against the real endpoint. This is the part that mattered. Instead of theorizing about thresholds, I regenerated a story with the exact same inputs and ran it through the live moderation model, but this time I printed the per-category scores, not just the booleans:

// repro: generate the story, then moderate it and print category scores
const story = await generate({
  mode: 'adventure',
  characters: ['lion', 'tiger', 'sabertooth'],
  themes: ['kindness', 'perseverance'],
  language: 'tr', age: 10,
});

const mod = await fetch('https://api.openai.com/v1/moderations', {
  method: 'POST',
  headers: { Authorization: `Bearer ${KEY}`, 'Content-Type': 'application/json' },
  body: JSON.stringify({ model: 'omni-moderation-latest', input: story }),
}).then(r => r.json());

console.log(mod.results[0].categories);       // violence: true
console.log(mod.results[0].category_scores);  // violence: 0.354  <-- !!

The story it choked on? A lion, a tiger, and a sabertooth team up to repair a storm-broken stone bridge so the villagers can reach their spring again. That’s it. That’s the whole plot. Pure kindness and teamwork. And moderation said:

overall flagged: true
violence         | bool: true  | score: 0.354
violence/graphic | bool: false | score: 0.051
sexual           | bool: false | score: 0.003
...

There it is. The violence boolean was true at a score of 0.35. The gore category (violence/graphic) wasn’t remotely close, sitting at 0.05. The model was flagging the mere presence of peril, a storm and a broken bridge, as “violence.” And my code took that boolean at face value and slammed the door.

The root cause, in one sentence

OpenAI’s violence boolean fires on ordinary story peril (a chase, a storm, a rescue) and my filter hard-failed on that boolean, which is fundamentally incompatible with an app that has Adventure and Thrill modes.

The moderation model wasn’t wrong, exactly. A story where animals face danger does depict “violence” in the loosest possible sense. The mistake was entirely mine: I’d asked a general-purpose safety classifier a yes/no question, and then accepted its general-purpose threshold for “yes”. What my product needed was a children’s-fiction threshold, which is a very different line.

Think of the moderation boolean like the check-engine light on a rental car. It’s tuned to be conservative for every driver in every situation, so it trips early and often. Useful as a first warning. Useless as the only thing you look at when you actually know what car you’re driving and where you’re going. Underneath the light there’s a gauge with an actual number on it. That number was category_scores, and I’d been ignoring it.

Turning a hunch into a number

A whimsical dashboard with a single large dial, its needle resting in a green safe zone beneath a small glowing warning light, a small creature reading it.
The boolean is the warning light. The score is the gauge behind it, and only the gauge tells you how close to the edge you actually are.

Before changing a single line, I calibrated. I ran a spectrum of texts through the same endpoint to see where the violence score actually lands, for content I want to allow versus content I must block:

Content violence score violence/graphic
Wholesome adventure (chase / rescue) 0.12–0.35 ~0.01
Scary-but-safe thrill (creature in the dark) 0.08 ~0.00
Mild fantasy combat (knight strikes a dragon) 0.69 ~0.00
Brutal killing / execution (no gore) 0.94 0.05
Graphic gore 0.94 0.999

Two things fell straight out of that table.

First, a score threshold cleanly separates the cases. Story-legitimate peril sits at 0.7 and below; genuinely brutal content sits at 0.94 and above. There’s a wide, empty canyon between them, and a threshold around 0.85 lives comfortably in it, touching neither side.

Second (and this is the part I would have gotten wrong if I’d trusted my gut), I couldn’t just drop the violence check. Look at the “brutal killing” row: violence is 0.94, but violence/graphic is only 0.05. If I’d relied on the gore category alone, an execution scene (violent but not graphic) would have sailed straight through to a bedtime. I needed a high violence threshold, not no threshold.

The fix: two tiers

A friendly round robot standing at a glowing storybook door with a checklist and magnifying glass, waving some books through and inspecting an open picture book.
A good safety filter is a doorman, not a wall: it waves the harmless through and stops only what it must.

The redesign splits the categories into two kinds, based on a single question: does a legitimate children’s story ever produce this?

  • Bright-line categories: no wholesome kids’ story ever generates these, so I trust the API’s boolean: sexual, sexual/minors, hate, and violence/graphic (gore).
  • Fiction-legitimate categories: these fire on ordinary story material, so I gate them on a score instead of the boolean: violence (peril), and, as I’ll get to, harassment and self-harm too.
// Bright-line: any boolean-true fails. No kids' story produces these.
const HARD_CATEGORIES = ['sexual', 'sexual/minors', 'hate', 'violence/graphic'];

// Fiction-legitimate: omni-moderation sets these booleans on ordinary story
// material, so we flag only clearly-egregious content above a score threshold.
const SCORE_GATED_CATEGORIES = ['violence', 'harassment', 'self-harm'];

// cats = results[0].categories (booleans); scores = results[0].category_scores
const flagged =
  HARD_CATEGORIES.some(c => cats[c] === true) ||
  SCORE_GATED_CATEGORIES.some(c => (scores[c] ?? 0) >= SAFETY_FICTION_THRESHOLD); // 0.85

The threshold is a config value (SAFETY_FICTION_THRESHOLD, defaulting to 0.85), so if a legitimate thrill story ever trips it, I can tune the line without a redeploy. One caveat worth stating: because this runs against omni-moderation-latest, a moving alias, the score distribution can shift when OpenAI updates the underlying model, so the threshold is worth re-checking on each model change (or pin a dated snapshot if it has to stay put).

What you get for it: the wholesome adventure that failed before (violence 0.35) now passes. The knight and dragon (0.69) passes. The execution (0.94) and the gore (0.999 on the still-boolean graphic category) still fail, hard. That’s exactly the policy a kids’-adventure app should have, and it’s the policy I thought I’d written the first time.

The part I’m glad I checked

A small lonely fox sitting under a large tree at dusk, looking wistful, with a warm hopeful sunrise glowing on the horizon behind it.
Every good story dips before it lifts. A filter that mistakes sadness for danger blocks the whole arc.

Here’s where I nearly declared victory and shipped. One-line root cause, clean fix, tests pass. Done, right?

Not quite. If the violence boolean over-fires on peril, I owed it to myself to ask: do the other categories over-fire on other ordinary story ingredients? Every adventure has an antagonist. Every emotional arc has a low point before the resolution. So I ran the same calibration on wholesome-but-emotional scenes:

Wholesome kids’-story scene Category tripped Score
A bully taunts the hero (“nobody wants you here”) harassment 0.64 → boolean true
A lonely, sad character (“I’d rather fade away”) self-harm 0.52 → boolean true
Us-vs-them tribes, name-calling hate ≤ 0.20

Same disease. harassment fires on any antagonist or bully, and every adventure story has one. self-harm fires on ordinary sadness or loneliness, the completely normal dip before the happy ending, which is half of children’s literature. Meanwhile hate stayed near zero on wholesome content, confirming it genuinely belongs in the bright-line tier.

So harassment and self-harm moved into the score-gated tier alongside violence. Same threshold, same principle: block the egregious, allow the story-legitimate.

The lesson within the lesson: when you find one instance of a bug, check whether the whole category recurs. A fix that only patches the one symptom you happened to see is half a fix.

Failing safely is also a UX problem

Two more things this dragged into the light, each worth naming.

Fail closed, but don’t punish the user for your false positive. When a story failed, the app dumped you back to a create form reset to its defaults, so retrying meant re-picking every character, theme, tone, age, and name from scratch. That’s a rough thing to do to someone right after a failure that wasn’t their fault. The fix is small and obvious once you see it: persist the submitted parameters, restore them after a failure, make “Try again” a single tap. Clear them on success so the happy path is unchanged.

Opaque safety messages erode trust. “We couldn’t make a safe story from those choices” quietly told a parent that their perfectly innocent inputs were somehow inappropriate. That’s worse than unhelpful: it’s a little insulting. Once the over-blocking was fixed the message rarely shows at all, but the principle stands regardless: a safety refusal should never imply the user did something wrong when the truth is the system hit its own limits.

What I’d tell you to take from this

  • A moderation boolean is someone else’s policy. Your product needs its own. The boolean bakes in a general-purpose threshold for “unsafe.” The category_scores let you set yours. If your product legitimately touches an unsafe-ish category (peril in adventure fiction, medical content, security research, dark themes in art), you almost certainly want the score, not the flag.
  • Fail closed, but calibrate the line with real data. “Block violence” sounds responsible and is nearly useless as a spec. Run the actual content through the actual classifier and look at the numbers. My right threshold was sitting in a table I generated in five minutes.
  • Match the filter to the product. I had Adventure and Thrill modes and a filter that blocked adventure and thrill. The safety policy and the product spec were in direct contradiction, and nobody noticed until a real story fell into the gap.
  • Debug by tracing and reproducing, not by guessing. The whole thing collapsed to one line once I traced the error string to its single source and reproduced against the live endpoint with scores printed.
  • Fix the category, not the instance. Violence led me to harassment and self-harm: one symptom, three faces of the same mistake.

That said, I don’t want to leave you thinking the boolean is a design flaw. It isn’t. For most apps (a comment box, a support chat, a generic assistant), the boolean is the right call, and reaching for scores would be over-engineering a line that OpenAI already drew sensibly. The point isn’t “always use the score.” It’s that the moment your product’s definition of “safe” diverges from the generic one, you’ve taken on the job of drawing that line yourself, and no boolean can do it for you.

The whole thing was a backend fix. No app-store update, no reinstall. The story worked in the already-installed app the moment the server redeployed, which meant that night, my kid finally got his lion, his tiger, and his sabertooth, fixing a bridge in the rain.

← all writing