Ciki Zeng
← Back to Blog
2026-04-30· 8 min readJumpOnionInfrastructureAI Engineering

Day 8: The LLM Succeeded. The Platform Killed the Connection.

New endpoint. Generates a personalized training plan from a diagnosis via a chain of LLM calls. Local development: works in 50-something seconds. Test suite: green. Deployed to production. Time for the first real-world smoke.

I clicked "Generate Plan." The browser hung. A minute passed. The response came back: 504 Gateway Timeout. The natural conclusion was that the LLM had failed somehow. I hit refresh on the dashboard.

The new plan was already there.

Two Sources of Truth Disagreed

From the frontend's view, the request failed. The user saw 504. From the backend's view, everything was fine — the LLM finished, the plan was saved, the database row was written, no errors logged.

Two sources of truth disagreeing is the most diagnostic moment in software. Either one is lying or one is incomplete. My instinct was to start prodding the backend. The audit log answered first.

What the Audit Log Said

The same audit log that saved Day 7's incident answered this one too. Every LLM call had a row: model name, prompt revision, start timestamp, end timestamp, status, output.

For this request:

call_id: 1f3...
model: <llm-model>
prompt_rev: <prompt-version>
duration_ms: 59287       <-- 59.287 seconds
status: success
output_chars: 4831

The LLM had finished in 59.3 seconds. The deploy platform's gateway timeout was 60 seconds. The math was now clear: the LLM finished, the backend started writing the response, the gateway killed the client connection because it had been holding for ~60 seconds, and the backend completed the database write into the void. The 504 was the gateway giving up, not the LLM giving up.

Every deploy platform has a hard gateway timeout. It is almost always shorter than what your application's timeout config admits. Your app might be configured for 120 seconds. The gateway will quietly cap you at 60.

The Naive Fixes Wouldn't Work

Default debugging path for a 504 on an LLM call would be:

  1. Bump application timeout to 120s.
  2. Add retry logic on the client.
  3. Email the LLM provider blaming their latency.

None of these would have helped, because none of them touch the actual constraint. The gateway timeout is in the platform'sreverse proxy, not the application's config. It's a hidden ceiling. Bumping the app timeout to 120s just means the gateway kills the client at second 60 instead of the app killing it at second 120. Retry logic at the client just guarantees the same timeout happens twice.

The Fix Was Two-Pronged

Two parallel small fixes, neither glamorous, both shipped same day:

  1. Slim the prompt.The training-plan prompt had been sending the full diagnosis JSON including display-only fields the LLM was reasoning about but didn't need. Stripped down to the decision-relevant fields: input went from ~10K tokens to ~4K. Runtime dropped to 50-55 seconds. Bonus 20% cost reduction on every call. This bought back the margin against the 60-second ceiling.
  2. Frontend abort + polling fallback.The client now sets an AbortController at 58 seconds. If the POST hasn't returned, it cancels and starts polling a lightweight read endpoint that just checks: "is the cached plan ready yet?". The backend always writes the result, regardless of whether the client is still listening. From the user's view, the plan appears.

The architectural lesson: treat long-running synchronous HTTP calls as best-effort, with a polling fallback as the authoritative path. The 504 stopped being a failure mode and became an expected branch — fall through to polling, wait a few seconds, return the result.

The Audit Table Saved Me Twice

Two distinct incidents — Day 7's data deletion and Day 8's phantom timeout — were both solvable because of one architectural decision: persist every LLM call's full input/output to a dedicated table.

Day 7: audit log made data recovery possible.
Day 8: audit log made the root cause provable in three queries.

In both cases, the cost of not having the log would have been catastrophic. The cost of having it: a few cents a day in storage and a column that no one queries until they desperately need to. New SOP entry calcified out of these two incidents: for any product that ships an LLM call, persist every call's input and output to an audit table from day one. Day one. Not after the first incident. The first incident is exactly when you need the log you didn't write yet.

A New Checklist Item

Going forward, every new HTTP API that involves an LLM gets five questions before it ships:

  1. What's the platform's gateway timeout? (Not the app's — the platform's.)
  2. What's the realistic p95 latency of this LLM call?
  3. If (2) approaches (1), what's the fallback path?
  4. Is the LLM call's input + output written to the audit table?
  5. Does the frontend handle the "backend succeeded but gateway timed out" case gracefully?

If any answer is "I don't know," don't ship. Find out first. Five minutes of question-answering beats a week of phantom-timeout debugging.

Without SOP, With SOP

Without SOP

Bump app timeout to 120s. Add a retry. Open a support ticket with the LLM provider. Customers paying for plan generation see 504, never refresh, assume the product is broken. Churn quietly.

With SOP

Audit log diagnoses the problem in three queries. The platform gateway is the constraint, not the LLM. Prompt slim + frontend polling fallback ships same day. The user sees the plan appear, even when the gateway killed the connection. Cost-per-call drops 20% as a bonus.

The Real Lesson

Modern deploy platforms have constraints that are invisible until they bite you. Gateway timeouts. Body-size limits. Response streaming caps. Cold-start latencies on serverless functions. None of these show up in your app config. They live somewhere in the platform's docs that you read once during setup, three years ago, and never thought about again.

The audit table is the universal solvent for this class of problem. When two sources of truth disagree, the audit table is the third source — independent of both the frontend and the application logic, recording what actually happened on the call level. Two incidents already paid back its cost. There will be more.

Build for the gateway you have, not the gateway you wish for. And log every model call. The audit table will save you twice.

Next: Day 9 — three months earlier, a collaborator pasted an .env file into an AI chat to debug a deploy. Three months later, a grep over the project's knowledge vault found the keys still sitting there. Worse: the vault syncs to the cloud, and the cloud indexes into AI search.

Want the LLM-API checklist?

Templates, SOPs, and infrastructure rubrics — from $39.

See Pricing