MerchSage Engineering Blog

How we get clean design cutouts from a generative model

Sun, 10 May 2026 00:00:00 GMT

A print-on-demand pipeline needs transparent artwork. A t-shirt design has to live on white cotton, black cotton, and a hundred shades in between. The artwork itself has to ship as a transparent PNG with a clean alpha channel — no halo, no fringe, no ghost of the background bleeding into the design's edges.

The hard part is that generative image models don't produce transparent PNGs. They produce JPEG-grade RGB on a background. Whatever you do to remove the background later has to fight whatever the model decided to put there. After a lot of failed experiments, we landed on a technique that works for ~95% of designs on the first attempt. This post is what it is, and what we tried first.

The goal

The output is a transparent PNG of the artwork, edges crisp, alpha properly anti-aliased at the boundary. The artwork lands on a Printful product, gets composited onto a mockup, and then onto a real garment. Any fringe, smear, or color contamination from the original background is visible in production.

We need this to be reliable, automated, and run on every concept the pipeline generates. There is no human in the loop reviewing edge quality.

What didn't work #1: asking the model for transparency

Ask for "transparent background" or "PNG with alpha channel" and you get a checkerboard — the gray-and-white pattern design tools use to depict transparency. The model has seen this all over its training data and produces the visual metaphor. Background removers can strip it, but the result is ragged, with gray bleeding into the artwork's edges.

What didn't work #2: flat solid backgrounds

Next attempt: instruct the model to generate against a single solid color, picked to maximally contrast with the artwork. Use a chroma key remover.

Two failure modes:

The model doesn't actually produce flat solid color. It produces something close, with subtle texture, gradients, and lighting from the artwork bleeding into the background. Chroma keying treats anything within a tolerance band as "background" and anything outside as "foreground" — but with subtle texture, the tolerance has to be wide, and now you're keying out parts of the artwork that share a hue.
Color contamination at the boundary. The model treats the background as part of the image. Light from the artwork reflects onto the background. The boundary pixels are a blend of foreground and background colors. When you key out the background, the boundary pixels get the wrong alpha — and you can see the contaminating color in the cutout.

You can paper over this with edge-aware mattes and Photoshop tricks, but at scale it's brittle.

What didn't work #3: distinctive patterns

If the matter struggles with subtle texture, give it something unmistakable. We tried concentric rings, checkerboards, in black-and-white and in colors picked to contrast with the artwork.

Same failure mode as #2, in some ways worse. A pattern shares too much essence with the artwork — high-frequency, graphical, edge-heavy. Background removers separate something graphic from something not. When the background is itself a deliberate graphic, that distinction collapses, and the matter keys arbitrarily on whichever pattern element happens to be near the boundary. Edges came out worse than with flat color, not better.

The signal here pointed at the answer: the background needed to look fundamentally unlike the artwork — different in style, different in spatial frequency.

The technique that worked

Generative models are great at producing rich, naturalistic scenes. They're bad at producing flat color. So: ask for a rich scene that maximally contrasts with the artwork's palette, then use a real background remover that handles natural imagery well.

The pipeline:

Curate a small palette of high-contrast natural scenes. Five is enough.
Pick the scene whose representative color is maximally distant from the artwork's color palette in HSL space.
Generate the artwork over that scene as the background.
Strip the background via Photoroom.

The scenes are deliberately chosen to span the hue wheel. Whatever palette the artwork has, at least one scene will sit far from it.

The five scenes

This is SCENE_PALETTE from packages/python/merchsage/merchsage/concepts/backgrounds.py:

Hex	Description
`#2D5A27`	Oblique view of a dense coniferous forest patch on a hillside — no sky, no clouds — heavily blurred/defocused with soft, diffused natural daylight
`#C8A23D`	Close-up of a dry wheat field at golden hour — no sky, no horizon — heavily blurred/defocused with warm, diffused amber sunlight filtering through the stalks
`#3A6B7C`	Close-up of smooth river stones submerged in shallow clear water — no sky, no surface reflections — heavily blurred/defocused with cool, diffused overcast daylight
`#A0522D`	Close-up of layered sandstone rock face with natural iron-oxide striations — no sky, no vegetation — heavily blurred/defocused with soft, diffused warm daylight
`#7B5EA7`	Close-up of a dense lavender field in full bloom — no sky, no paths — heavily blurred/defocused with soft, diffused cool daylight casting gentle violet shadows

A few non-obvious choices:

Heavily blurred / defocused. This is critical. A sharp natural scene gives the background remover too many edges to confuse with the artwork's edges. A defocused scene reads as a soft color field with low spatial frequency — easy to subtract.
No sky, no horizon, no clouds. Skies are bright and uniform; they create artificial flat regions where chroma keying behavior re-emerges. The instructions explicitly forbid them.
Diffused light. Direct sunlight produces hot specular highlights that compete with the artwork. Diffused light gives uniform exposure across the frame.
Hue spread. Forest green, wheat gold, river blue, sandstone red-brown, lavender violet. Five hues, ~72° apart on the wheel. Whatever the artwork is, at least one scene will be in opposition.

Greedy max-min selection in HSL space

For a given concept, we want the scene whose representative color is furthest from any color in the artwork palette.

def _hsl_distance(hsl1, hsl2):
    """Squared distance between two HSL tuples (wrapping hue)."""
    dh = min(abs(hsl1[0] - hsl2[0]), 1 - abs(hsl1[0] - hsl2[0]))
    ds = hsl1[1] - hsl2[1]
    dl = hsl1[2] - hsl2[2]
    return dh**2 + ds**2 + dl**2

The hue distance wraps at 1.0 — red and magenta are close, even though their numeric hue values are far apart. The selection is a simple greedy max-min:

for each scene, compute the minimum HSL distance to any artwork color. pick the scene with the maximum of those minima.

In other words: pick the scene that is far from the artwork's closest color, not its average. This is the right objective because the failure mode of background removal is "this background pixel got confused with that artwork pixel" — what matters is the worst pair, not the typical pair.

The HSL space here is preferable to RGB or LAB. Hue captures the perceptual axis humans (and Photoroom's matting model) actually disambiguate on. Saturation and lightness are secondary signals — they don't dominate.

Stripping with Photoroom

Once the artwork is generated against the selected scene, Photoroom's API does the actual matting. Two reasons we picked it:

It produces clean alpha at edges, including hair-like fine detail. Most chroma keyers don't.
It handles natural imagery well. The scene is full-frame nature; Photoroom doesn't get confused by it because nature is what it's trained on.

The remaining pipeline is mundane: alpha threshold to clean up sub-1% alpha noise, crop to the artwork's bounding box, save as PNG. Up to 20 concurrent Photoroom calls per pipeline run.

Catching the misses

Photoroom gets us most of the way, but it isn't perfect. A scene texture occasionally bleeds into a thin negative-space region. The model sometimes paints the artwork with edges that share a hue with the scene, and the matte cuts in too far. A halo of warm pixels can ring an element after the cutout. We can't ship those.

The fan rater downstream is the cleanup pass. It gets two images per design:

Image A: the original generation, scene background still present.
Image B: the artwork after the scene has been stripped.

Its job is to compare them and flag structural artifacts in B that came from A's scene. The vocabulary is specific: a remnant is a blob, halo, smear, or patch — something with shape and location that isn't part of the intended design. A ring of grass-green pixels around a coffee cup is a remnant. A semi-transparent smudge of sky in empty space is a remnant.

What isn't a remnant matters as much, because without these exclusions the rater over-flags:

Scene-lighting tints baked into design colors. Generating against a wheat field warms the foreground hues; against river stones it cools them. Those tints persist into the cutout. They look like contamination but aren't — they're the model's rendering, and they're in every design.
Soft anti-aliased edges. A 1–2px alpha gradient is how a clean cutout should look. Penalizing it produces crunchy, aliased designs.
Intentional elements the artwork description names — stars, dots, glow rings. Without this clause the rater flags legitimate stylistic flourishes as scene bleed.

Designs the rater flags as having visible remnants get thrown away.

Results

Roughly 95% of designs come out of Photoroom clean on the first pass. The rater catches most of what doesn't. By the time designs reach production, visible cutout artifacts show up about 1 in 300.

The takeaway

The generative model does what it's good at: producing a rich naturalistic scene. The classical CV pipeline (color-distance scene selection + Photoroom matting) does what it's good at: separating foreground from a non-degenerate background.

Most of the time, when a generative pipeline gives bad output, the answer isn't a better prompt or a better model. It's recognizing which step in the pipeline is asking the model to do something it's bad at, and replacing that step with a deterministic one.

Agentic Kestra: making an LLM a first-class flow author

Sat, 09 May 2026 00:00:00 GMT

MerchSage runs on Kestra. The merch pipeline itself is around 30 flows. We liked the architecture enough that we extended it to the rest of the business — ops and marketing run as Kestra flows too — bringing the total close to 60, all backed by a single Postgres database that every stage reads from and writes to. Every line of code in the system was written by Claude — not a single human line. The humans direct product and architecture, with Claude's help on both.

This post is about what changes when you take that seriously. Specifically: what tools an LLM needs to be a real Kestra collaborator, and what code-level contracts you have to enforce so it can debug its own work.

The problem

If you let an LLM write a Kestra flow with only a filesystem and a shell, here's what you'll observe.

It writes the flow. It tries to "test" by reading the YAML back. It cannot dispatch a run. If it could dispatch, it cannot follow logs. If it could follow logs, it cannot inspect the rows that the flow's Python tasks wrote. If it cannot inspect rows, it cannot debug the data — only the syntax. It will then over-correct, add layers of defensive code, and bury the actual bug.

The fix isn't a smarter model. The fix is closing the loop. The author needs to be able to operate what they author.

The MCP

We ship @merchsage/mcp-kestra, an MCP server that exposes the operations needed to close the loop. The tools fall into four buckets.

Authoring

Tool	What it does
`kestra_sync`	Sync flows or namespace files to the Kestra server. Targets: `flow`, `namespace`, `all`.

Authoring is the easy part. The hard part is what comes after.

Operating

Tool	What it does
`kestra_run`	Dispatch any flow with inputs. Optional `wait=True` polls every 10s.
`kestra_pipeline_test`	Cheap-test wrapper for the main pipeline. Selects stages, picks a small/fast channel, runs against the `onboarding_micro` config preset.
`kestra_status`	Status, stage progress, and artifact counts for an execution.
`kestra_logs`	Logs for an execution, filterable by level.
`kestra_list`	Recent executions for a flow.

kestra_pipeline_test is the one we'd argue for hardest. Without a cheap-test wrapper, the model defaults to either "I think it works" or running a full pipeline that costs real money. With one, it dispatches a one-design micro-run, waits ~12 minutes, and reads the artifact counts. That's the development inner loop.

Inspecting (the database)

Tool	What it does
`db_query`	Read-only SQL. Auto-appends `LIMIT 50`. Returns formatted table.
`db_schema`	List tables or describe one table's columns/types.
`db_channel`	Look up a channel by UUID, handle (`@Name`), or YouTube channel ID.
`db_artifacts`	Count all pipeline artifacts for a channel.
`db_execution`	Status, stage progress, and artifact counts for an execution.
`db_count` / `db_get` / `db_find`	Fast typed queries for common shapes.
`db_reset_channel`	Delete all pipeline artifacts for a channel (dry-run by default).

This is the half of the loop that's usually missing. A flow's logs tell you "task X succeeded." The DB tells you whether task X did the right thing. With db_query, the agent can ask "did this run actually write a design_variants row with non-null s3_key?" instead of guessing from a green checkmark.

db_reset_channel deserves a comment: it defaults to dry_run=true and prints what it would delete. We added the dry-run default after the model removed real channel data trying to "clean up state." Defaults matter when an LLM is the one calling.

Codebase contracts

The MCP closes the loop. The codebase has to make the loop useful. Five contracts make agentic authoring tractable.

1. Thin inline Python in flows

Business logic belongs in Python modules. Flow YAML contains input declarations, task wiring, env injection, and thin wrapper scripts (<15 lines) that import a module function and call emit_outputs.

# GOOD
script: |
  import sys, os
  sys.path.insert(0, ".")
  from merchsage.listing.seo import enrich_design_seo
  from merchsage.kestra import emit_outputs

  result = enrich_design_seo(
      design_id=os.environ["DESIGN_ID"],
      channel_uuid=os.environ["CHANNEL_UUID"],
  )
  emit_outputs(result)

200 lines of business logic in a YAML string is unreadable, untestable, and un-fixable for both a human and an agent. We mechanically resist that pattern.

2. `pluginDefaults` injects credentials

Every Python task gets credentials and EXECUTION_ID automatically through global pluginDefaults. Flow tasks don't declare env: for GEMINI_TOKEN, DB_HOST, AWS_*, etc. They only declare env: for dynamic, task-specific values like CHANNEL_UUID.

The agent doesn't have to remember which env vars to plumb. It writes os.environ["GEMINI_TOKEN"] and it works. Reduces a whole class of "I forgot to map this" bugs.

3. Fail fast over fallbacks

This one is a behavioral contract more than an architectural one. If you let an LLM write code with no constraints, it will add try/except around every API call and fall back to defaults. Six months later, you'll have a pipeline that appears to work and silently produces wrong artifacts.

The codebase rule is: no synthetic data, no fallbacks, no backward-compatibility shims. Missing prompt? Crash. Missing required field? Crash. Mismatched fields? Skip with a warning. Required artifact missing? sys.exit(1).

# BAD
prompt = get_prompt("my_prompt") or "Some hardcoded fallback"

# GOOD
prompt = get_prompt("my_prompt")  # raises PromptLoadError

A failed pipeline run is cheap. A silently-degraded run is not. This rule is how we keep agent-authored code debuggable.

4. `os.environ["KEY"]`, not `.get(default)`

Same principle, narrower instance. Flow YAML always provides declared env vars at runtime, so a Python-side default is dead code that masks missing configuration.

# BAD — hardcoded default masks missing config
region = os.environ.get("S3_REGION", "eu-west-1")

# GOOD — KeyError if missing, fixed in seconds
region = os.environ["S3_REGION"]

A KeyError with a clear name is a one-line fix. A wrong default that produces wrong artifacts is a week-long mystery.

5. DB-first stage handoff

Stages don't pass outputs to one another through Kestra. They write to the DB and exit. The next stage loads what it needs by channel_uuid.

The big win is iteration. Once a stage has produced its output, you can re-run any downstream stage against that output as many times as you like — with different params, different prompts, or different code — without paying to regenerate the upstream work. A set of concepts can drive a dozen design experiments. A set of designs can produce mockups across several product configurations. The expensive upstream work is amortized across many cheap downstream variations. This is more than Kestra's task replay, which restarts a failed task in place — it's iterating on stage logic against a fixed upstream set.

Debuggability comes along for the ride. An agent investigating "why did Stage 5 produce no mockups?" can read the Stage 4 outputs from design_variants directly, without replaying Stage 4 or parsing an outputs JSON blob in a Kestra log. The DB is the audit trail.

What this unlocks

When the loop is closed and the contracts hold, the agent goes from writing code to operating the system. Concretely:

It writes a new flow, syncs it, dispatches a micro-run, reads kestra_logs, finds a KeyError on a missing env, fixes it, re-syncs, re-runs.
It investigates "why did this channel only get 1 design instead of 8?" by db_query-ing design_variants joined to product_concepts, finding the rating threshold rejected 7 of them, and adjusting the config preset.
It diagnoses a stuck phase_creative execution by reading kestra_status, killing it, calling db_reset_channel with dry_run=true first to confirm what it'll delete, then running it for real.

None of those steps require the agent to ask "what should I do?" The information it needs is directly accessible by tool call.

The point

Two things make this work, and they're co-dependent. The MCP closes the operating loop — author, run, observe, inspect, reset. The codebase contracts make the signals from that loop trustworthy — fail loudly, never fall back, never silently degrade.

The same setup now runs the merch pipeline, the ops automation, and the marketing flows — with the humans focused on product and architecture, not the code.

How MerchSage turns a YouTube channel into print-on-demand merch in 6 stages

Fri, 08 May 2026 00:00:00 GMT

MerchSage takes a YouTube channel URL and produces a stocked storefront. No human picks the products. No human writes artwork prompts. No human approves a design before it lands on a t-shirt. The whole thing runs as a 6-stage pipeline.

This post is the overview — what each stage contributes and how they fit together. Two of the stages are interesting enough to deserve their own posts, linked below.

The 6 stages

Scrape → Analyze → Generate Concepts → Create Designs → Generate Mockups → Finalize Listings

Stage	Role
1. Scrape	Pull raw material from YouTube
2. Analyze	Build a structured creative brief for the channel
3. Concepts	Turn the brief into design briefs
4. Designs	Render the briefs as artwork
5. Mockups	Show the artwork on real products
6. Listings	Produce storefront-ready listing drafts

Each stage has one job. Each stage can be re-run on its own.

Stage 1 — Scrape

We pull the raw evidence: a representative sample of the channel's videos, their transcripts, comments, and visual material. The sample is time-spread and outlier-trimmed, so a single viral hit doesn't drag the brand reading off-center. This is the only stage that reaches outside the system — everything downstream works from what we capture here.

Stage 2 — Analyze

A set of AI specialists read the scraped material and build a structured understanding of the channel. Between them, they produce:

Brand understanding — what the creator stands for, who the audience is, the personality of the channel.
A visual design guide — palette, motifs, typography, the creative range that fits the brand.
A product plan — which products belong in this creator's lineup, and what mockup scenes match the channel's vibe.
Asset extraction — recurring visual elements (logos, faces, characters) lifted from channel imagery for reuse in designs.

Every downstream creative decision flows from this. Anything we generate later sits on top of the design guide and the product plan.

Stage 3 — Generate Concepts

Turn the creative brief into design briefs — the specifications that drive image generation.

We generate aggressively, far more concepts than we'll keep, to maximize creative diversity. A rating-and-pruning pass then selects the strongest, most distinctive ones to actually render.

The split between generation and selection is deliberate. Asking a model to be both wildly creative and ruthlessly discerning in a single pass produces safe, average output. Splitting it lets generation run uninhibited and selection run cold.

Stage 4 — Create Designs

Render the selected briefs into transparent artwork — ready to drop onto a product.

The hard part isn't the image generation itself — it's getting clean transparent cutouts reliably, on every design, at scale. That's its own post.

A rating pass scores the rendered designs. Anything the model bungled — visible artifacts, broken composition, off-brand colors — gets filtered before it ever reaches a product.

Stage 5 — Generate Mockups

Send each design to Printful to be rendered on real products — t-shirts, posters, mugs, phone cases — in scenes chosen to match the channel's vibe. A visual quality gate scores how well each design sits on its product. Mockups that don't pass are kept for review but excluded from the storefront.

Stage 6 — Finalize Listings

Pick the best design per product line, write SEO-ready listing copy, and produce drafts for the storefront. Publishing happens later — admin curation and the creator's own selection through the portal control what actually goes live.

How the stages fit together

Stages communicate through a shared database, not through pipeline outputs. Each stage persists everything it produces; the next stage loads what it needs. That decoupling is what makes any stage re-runnable in isolation, and what lets a failed stage stop a run cleanly without taking the rest of the pipeline with it.

The orchestration itself — how the stages are wired together, how concurrent pipeline runs share rate-limited APIs, how an LLM agent can author and operate the whole thing — is its own post.

A failed run is cheap. A silently-degraded run produces wrong artifacts. The whole codebase leans hard into the first.