Agentic Kestra: making an LLM a first-class flow author

MerchSage runs on Kestra. The merch pipeline itself is around 30 flows. We liked the architecture enough that we extended it to the rest of the business — ops and marketing run as Kestra flows too — bringing the total close to 60, all backed by a single Postgres database that every stage reads from and writes to. Every line of code in the system was written by Claude — not a single human line. The humans direct product and architecture, with Claude's help on both.

This post is about what changes when you take that seriously. Specifically: what tools an LLM needs to be a real Kestra collaborator, and what code-level contracts you have to enforce so it can debug its own work.

The problem

If you let an LLM write a Kestra flow with only a filesystem and a shell, here's what you'll observe.

It writes the flow. It tries to "test" by reading the YAML back. It cannot dispatch a run. If it could dispatch, it cannot follow logs. If it could follow logs, it cannot inspect the rows that the flow's Python tasks wrote. If it cannot inspect rows, it cannot debug the data — only the syntax. It will then over-correct, add layers of defensive code, and bury the actual bug.

The fix isn't a smarter model. The fix is closing the loop. The author needs to be able to operate what they author.

The MCP

We ship @merchsage/mcp-kestra, an MCP server that exposes the operations needed to close the loop. The tools fall into four buckets.

Authoring

Tool What it does
kestra_sync Sync flows or namespace files to the Kestra server. Targets: flow, namespace, all.

Authoring is the easy part. The hard part is what comes after.

Operating

Tool What it does
kestra_run Dispatch any flow with inputs. Optional wait=True polls every 10s.
kestra_pipeline_test Cheap-test wrapper for the main pipeline. Selects stages, picks a small/fast channel, runs against the onboarding_micro config preset.
kestra_status Status, stage progress, and artifact counts for an execution.
kestra_logs Logs for an execution, filterable by level.
kestra_list Recent executions for a flow.

kestra_pipeline_test is the one we'd argue for hardest. Without a cheap-test wrapper, the model defaults to either "I think it works" or running a full pipeline that costs real money. With one, it dispatches a one-design micro-run, waits ~12 minutes, and reads the artifact counts. That's the development inner loop.

Inspecting (the database)

Tool What it does
db_query Read-only SQL. Auto-appends LIMIT 50. Returns formatted table.
db_schema List tables or describe one table's columns/types.
db_channel Look up a channel by UUID, handle (@Name), or YouTube channel ID.
db_artifacts Count all pipeline artifacts for a channel.
db_execution Status, stage progress, and artifact counts for an execution.
db_count / db_get / db_find Fast typed queries for common shapes.
db_reset_channel Delete all pipeline artifacts for a channel (dry-run by default).

This is the half of the loop that's usually missing. A flow's logs tell you "task X succeeded." The DB tells you whether task X did the right thing. With db_query, the agent can ask "did this run actually write a design_variants row with non-null s3_key?" instead of guessing from a green checkmark.

db_reset_channel deserves a comment: it defaults to dry_run=true and prints what it would delete. We added the dry-run default after the model removed real channel data trying to "clean up state." Defaults matter when an LLM is the one calling.

Codebase contracts

The MCP closes the loop. The codebase has to make the loop useful. Five contracts make agentic authoring tractable.

1. Thin inline Python in flows

Business logic belongs in Python modules. Flow YAML contains input declarations, task wiring, env injection, and thin wrapper scripts (<15 lines) that import a module function and call emit_outputs.

# GOOD
script: |
  import sys, os
  sys.path.insert(0, ".")
  from merchsage.listing.seo import enrich_design_seo
  from merchsage.kestra import emit_outputs

  result = enrich_design_seo(
      design_id=os.environ["DESIGN_ID"],
      channel_uuid=os.environ["CHANNEL_UUID"],
  )
  emit_outputs(result)

200 lines of business logic in a YAML string is unreadable, untestable, and un-fixable for both a human and an agent. We mechanically resist that pattern.

2. pluginDefaults injects credentials

Every Python task gets credentials and EXECUTION_ID automatically through global pluginDefaults. Flow tasks don't declare env: for GEMINI_TOKEN, DB_HOST, AWS_*, etc. They only declare env: for dynamic, task-specific values like CHANNEL_UUID.

The agent doesn't have to remember which env vars to plumb. It writes os.environ["GEMINI_TOKEN"] and it works. Reduces a whole class of "I forgot to map this" bugs.

3. Fail fast over fallbacks

This one is a behavioral contract more than an architectural one. If you let an LLM write code with no constraints, it will add try/except around every API call and fall back to defaults. Six months later, you'll have a pipeline that appears to work and silently produces wrong artifacts.

The codebase rule is: no synthetic data, no fallbacks, no backward-compatibility shims. Missing prompt? Crash. Missing required field? Crash. Mismatched fields? Skip with a warning. Required artifact missing? sys.exit(1).

# BAD
prompt = get_prompt("my_prompt") or "Some hardcoded fallback"

# GOOD
prompt = get_prompt("my_prompt")  # raises PromptLoadError

A failed pipeline run is cheap. A silently-degraded run is not. This rule is how we keep agent-authored code debuggable.

4. os.environ["KEY"], not .get(default)

Same principle, narrower instance. Flow YAML always provides declared env vars at runtime, so a Python-side default is dead code that masks missing configuration.

# BAD — hardcoded default masks missing config
region = os.environ.get("S3_REGION", "eu-west-1")

# GOOD — KeyError if missing, fixed in seconds
region = os.environ["S3_REGION"]

A KeyError with a clear name is a one-line fix. A wrong default that produces wrong artifacts is a week-long mystery.

5. DB-first stage handoff

Stages don't pass outputs to one another through Kestra. They write to the DB and exit. The next stage loads what it needs by channel_uuid.

The big win is iteration. Once a stage has produced its output, you can re-run any downstream stage against that output as many times as you like — with different params, different prompts, or different code — without paying to regenerate the upstream work. A set of concepts can drive a dozen design experiments. A set of designs can produce mockups across several product configurations. The expensive upstream work is amortized across many cheap downstream variations. This is more than Kestra's task replay, which restarts a failed task in place — it's iterating on stage logic against a fixed upstream set.

Debuggability comes along for the ride. An agent investigating "why did Stage 5 produce no mockups?" can read the Stage 4 outputs from design_variants directly, without replaying Stage 4 or parsing an outputs JSON blob in a Kestra log. The DB is the audit trail.

What this unlocks

When the loop is closed and the contracts hold, the agent goes from writing code to operating the system. Concretely:

  • It writes a new flow, syncs it, dispatches a micro-run, reads kestra_logs, finds a KeyError on a missing env, fixes it, re-syncs, re-runs.
  • It investigates "why did this channel only get 1 design instead of 8?" by db_query-ing design_variants joined to product_concepts, finding the rating threshold rejected 7 of them, and adjusting the config preset.
  • It diagnoses a stuck phase_creative execution by reading kestra_status, killing it, calling db_reset_channel with dry_run=true first to confirm what it'll delete, then running it for real.

None of those steps require the agent to ask "what should I do?" The information it needs is directly accessible by tool call.

The point

Two things make this work, and they're co-dependent. The MCP closes the operating loop — author, run, observe, inspect, reset. The codebase contracts make the signals from that loop trustworthy — fail loudly, never fall back, never silently degrade.

The same setup now runs the merch pipeline, the ops automation, and the marketing flows — with the humans focused on product and architecture, not the code.