Agentic Kestra: making an LLM a first-class flow author
MerchSage runs on Kestra. The merch pipeline itself is around 30 flows. We liked the architecture enough that we extended it to the rest of the business — ops and marketing run as Kestra flows too — bringing the total close to 60, all backed by a single Postgres database that every stage reads from and writes to. Every line of code in the system was written by Claude — not a single human line. The humans direct product and architecture, with Claude's help on both.
This post is about what changes when you take that seriously. Specifically: what tools an LLM needs to be a real Kestra collaborator, and what code-level contracts you have to enforce so it can debug its own work.
The problem
If you let an LLM write a Kestra flow with only a filesystem and a shell, here's what you'll observe.
It writes the flow. It tries to "test" by reading the YAML back. It cannot dispatch a run. If it could dispatch, it cannot follow logs. If it could follow logs, it cannot inspect the rows that the flow's Python tasks wrote. If it cannot inspect rows, it cannot debug the data — only the syntax. It will then over-correct, add layers of defensive code, and bury the actual bug.
The fix isn't a smarter model. The fix is closing the loop. The author needs to be able to operate what they author.
The MCP
We ship @merchsage/mcp-kestra, an MCP server that exposes the operations needed to close the loop. The tools fall into four buckets.
Authoring
| Tool | What it does |
|---|---|
kestra_sync |
Sync flows or namespace files to the Kestra server. Targets: flow, namespace, all. |
Authoring is the easy part. The hard part is what comes after.
Operating
| Tool | What it does |
|---|---|
kestra_run |
Dispatch any flow with inputs. Optional wait=True polls every 10s. |
kestra_pipeline_test |
Cheap-test wrapper for the main pipeline. Selects stages, picks a small/fast channel, runs against the onboarding_micro config preset. |
kestra_status |
Status, stage progress, and artifact counts for an execution. |
kestra_logs |
Logs for an execution, filterable by level. |
kestra_list |
Recent executions for a flow. |
kestra_pipeline_test is the one we'd argue for hardest. Without a cheap-test wrapper, the model defaults to either "I think it works" or running a full pipeline that costs real money. With one, it dispatches a one-design micro-run, waits ~12 minutes, and reads the artifact counts. That's the development inner loop.
Inspecting (the database)
| Tool | What it does |
|---|---|
db_query |
Read-only SQL. Auto-appends LIMIT 50. Returns formatted table. |
db_schema |
List tables or describe one table's columns/types. |
db_channel |
Look up a channel by UUID, handle (@Name), or YouTube channel ID. |
db_artifacts |
Count all pipeline artifacts for a channel. |
db_execution |
Status, stage progress, and artifact counts for an execution. |
db_count / db_get / db_find |
Fast typed queries for common shapes. |
db_reset_channel |
Delete all pipeline artifacts for a channel (dry-run by default). |
This is the half of the loop that's usually missing. A flow's logs tell you "task X succeeded." The DB tells you whether task X did the right thing. With db_query, the agent can ask "did this run actually write a design_variants row with non-null s3_key?" instead of guessing from a green checkmark.
db_reset_channel deserves a comment: it defaults to dry_run=true and prints what it would delete. We added the dry-run default after the model removed real channel data trying to "clean up state." Defaults matter when an LLM is the one calling.
Codebase contracts
The MCP closes the loop. The codebase has to make the loop useful. Five contracts make agentic authoring tractable.
1. Thin inline Python in flows
Business logic belongs in Python modules. Flow YAML contains input declarations, task wiring, env injection, and thin wrapper scripts (<15 lines) that import a module function and call emit_outputs.
# GOOD
script: |
import sys, os
sys.path.insert(0, ".")
from merchsage.listing.seo import enrich_design_seo
from merchsage.kestra import emit_outputs
result = enrich_design_seo(
design_id=os.environ["DESIGN_ID"],
channel_uuid=os.environ["CHANNEL_UUID"],
)
emit_outputs(result)
200 lines of business logic in a YAML string is unreadable, untestable, and un-fixable for both a human and an agent. We mechanically resist that pattern.
2. pluginDefaults injects credentials
Every Python task gets credentials and EXECUTION_ID automatically through global pluginDefaults. Flow tasks don't declare env: for GEMINI_TOKEN, DB_HOST, AWS_*, etc. They only declare env: for dynamic, task-specific values like CHANNEL_UUID.
The agent doesn't have to remember which env vars to plumb. It writes os.environ["GEMINI_TOKEN"] and it works. Reduces a whole class of "I forgot to map this" bugs.
3. Fail fast over fallbacks
This one is a behavioral contract more than an architectural one. If you let an LLM write code with no constraints, it will add try/except around every API call and fall back to defaults. Six months later, you'll have a pipeline that appears to work and silently produces wrong artifacts.
The codebase rule is: no synthetic data, no fallbacks, no backward-compatibility shims. Missing prompt? Crash. Missing required field? Crash. Mismatched fields? Skip with a warning. Required artifact missing? sys.exit(1).
# BAD
prompt = get_prompt("my_prompt") or "Some hardcoded fallback"
# GOOD
prompt = get_prompt("my_prompt") # raises PromptLoadError
A failed pipeline run is cheap. A silently-degraded run is not. This rule is how we keep agent-authored code debuggable.
4. os.environ["KEY"], not .get(default)
Same principle, narrower instance. Flow YAML always provides declared env vars at runtime, so a Python-side default is dead code that masks missing configuration.
# BAD — hardcoded default masks missing config
region = os.environ.get("S3_REGION", "eu-west-1")
# GOOD — KeyError if missing, fixed in seconds
region = os.environ["S3_REGION"]
A KeyError with a clear name is a one-line fix. A wrong default that produces wrong artifacts is a week-long mystery.
5. DB-first stage handoff
Stages don't pass outputs to one another through Kestra. They write to the DB and exit. The next stage loads what it needs by channel_uuid.
The big win is iteration. Once a stage has produced its output, you can re-run any downstream stage against that output as many times as you like — with different params, different prompts, or different code — without paying to regenerate the upstream work. A set of concepts can drive a dozen design experiments. A set of designs can produce mockups across several product configurations. The expensive upstream work is amortized across many cheap downstream variations. This is more than Kestra's task replay, which restarts a failed task in place — it's iterating on stage logic against a fixed upstream set.
Debuggability comes along for the ride. An agent investigating "why did Stage 5 produce no mockups?" can read the Stage 4 outputs from design_variants directly, without replaying Stage 4 or parsing an outputs JSON blob in a Kestra log. The DB is the audit trail.
What this unlocks
When the loop is closed and the contracts hold, the agent goes from writing code to operating the system. Concretely:
- It writes a new flow, syncs it, dispatches a micro-run, reads
kestra_logs, finds aKeyErroron a missing env, fixes it, re-syncs, re-runs. - It investigates "why did this channel only get 1 design instead of 8?" by
db_query-ingdesign_variantsjoined toproduct_concepts, finding the rating threshold rejected 7 of them, and adjusting the config preset. - It diagnoses a stuck
phase_creativeexecution by readingkestra_status, killing it, callingdb_reset_channelwithdry_run=truefirst to confirm what it'll delete, then running it for real.
None of those steps require the agent to ask "what should I do?" The information it needs is directly accessible by tool call.
The point
Two things make this work, and they're co-dependent. The MCP closes the operating loop — author, run, observe, inspect, reset. The codebase contracts make the signals from that loop trustworthy — fail loudly, never fall back, never silently degrade.
The same setup now runs the merch pipeline, the ops automation, and the marketing flows — with the humans focused on product and architecture, not the code.