Designing Control Into AI Video Creation

I restructured a one-shot Gen-2 text-to-video tool into a four-stage creative workspace — basics, outline, script, visuals — with checkpoints and version history so creators iterate instead of re-generating.

6 moderated usability tests6/6 users struggled with visual editingOne-shot generation → staged creative control

Product: VP Genie
Role: Design · Information Architecture · Usability Test
Timeline: January – April 2025

Before

One pass, no checkpoints

Script, visuals, and shots all generate from a single prompt

Prompt

no checkpoints

Script

Visuals

Shots

Output

After

Staged, with checkpoints

Each stage is reviewed and approved before the next begins

01Basics

tonegoalaudience

02Outline

structurebeats

03Script

dialoguenarration

04Visuals

shotsscenes

The problem

“I don't think I'm smart enough for this tool.”

P3 clicked 12 times. Each time the AI returned something different — never closer. After the twelfth attempt, she stopped.

She is. Studio Engine.ai collapses professional pre-production — script, characters, props, storyboard — into one prompt. The power was real. The mental model assumed expertise most users didn't have.

HMW

How might Studio Engine.ai Gen-2 serve both professionals and emerging creators — and convert free users to paid?

Research

AI was treated as a one-shot oracle.

We tested three workflow phases. Task 2 is where the experience broke.

Task 01 · Script

Script generation — manageable

5 / 6 completed

Where it broke

Task 02 · Visual editing

Visual editing broke for everyone

2 / 6

completed character edit

3 / 6

completed location edit

AI output unpredictable — 6/6 clicked regenerate repeatedly with no convergence; had no way to communicate what was wrong
No recovery path — 3/6 lost work permanently when regenerating; there was no undo
Inpainting invisible — users found the button but had no mental model for what it would affect or how to use it
4-screen editing path — changing a character's hair required: project overview → character list → character editor → inpainting tool. 6/6 lost context mid-flow

Task 03 · Storyboard

Storyboard — friction but functional

4 – 5 / 6 completed

AI-native principle

Designing control around AI uncertainty.

What makes text-to-video UX different

Traditional tools: cursor touches output. Intent equals result.

AI tools: system interprets, generates, surprises. The gap between intent and output is where trust breaks.

New control patterns needed

Compare optionsVariations, not verdicts
Recover historyEvery generation reversible
Stage generationReview before next phase
Show progressPipeline legible in real time
Explain AI actionsSurface what the model did

Design framework

From One-Shot Generation to Staged Creative Control

The five control patterns above need somewhere to live. I mapped them onto a staged creative workflow so each pattern has a home in the product, not just in the model.

Two layers, one system. The 4-stage pipeline shown at the top — Basics → Outline → Script → Visuals — is the AI generation flow. The 5-area IA below wraps around it: an entry point (Input), the pipeline itself (Basics, Visuals), and post-generation work (Edit, Manage).

Input

“What am I trying to make?”

Capture creative intent before asking for detailed production choices.

Basics

“Is the story direction right?”

Let users review the story foundation before visual generation begins.

Visuals

“Which assets match my intent?”

Turn AI output into selectable options, not a single verdict.

Edit

“How do I refine without losing context?”

Keep preview, controls, references, inpainting, and version history in one workspace.

Manage

“Where does this project live next?”

Give users a clear place to organize, export, and continue projects.

Decision 01

Reframe AI: from oracle to collaborator

6/6 participants encountered AI outputs they couldn't steer — the study's highest-severity finding. The product gave one output and waited for acceptance. If it was wrong, the only option was to regenerate and hope.

Gen-2 makes two structural changes. Visual generation is gated behind a script checkpoint— users review and commit the script before any images run, so a bad prompt doesn't cascade into dozens of wrong assets. When visuals do generate, the interface returns three variations at once: pick the closest match, regenerate that specific option, or save it to history. The AI shifts from decision-maker to collaborator.

Before state — one output per generation

Before: one output per generation — accept or restart.

After state — multi-option visual selection

After: 3 variations at once — pick, iterate, or save to history.

Decision 02

Make generation reversible

3/6 participants lost work to regeneration with no undo. A generation history panel saves every output — return, compare, or recover at any point.

Problem — lost previous version after regenerate

Before: regeneration overwrote previous work with no way back.

After: every generation is saved — return, compare, or recover at any point.

Decision 03

Bring editing onto one surface

Character editing required 4+ screen transitions. Only 2/6 completed it— the study's lowest success rate. A consolidated panel puts generation, inpainting, and history on one surface.

Problem — fragmented editing across pages

Before: visual editing was split across multiple pages.

After: generation, inpainting, references, and history live in one workspace.

Outcomes

What shipped, and what's still open

The framework was adopted as the direction for Gen-2: a script-locked checkpoint before any visuals run, multi-option visual output, a generation history panel, and a consolidated editing workspace.

We did not get to A/B the redesign in production before I left the engagement, so I can't claim a conversion or retention number. What I would track in the next round — and what success would look like — is below.

Visual editing completion

Baseline · 33–50% (study)

Target · ≥ 80%

T3 / T4 in the usability study — the lowest-success tasks.

Regenerations per asset

Baseline · Not tracked

Target · Trending down over a session

A proxy for convergence — users getting closer, not just trying again.

History panel adoption

Baseline · n/a (new surface)

Target · ≥ 60% of users use it within first 3 sessions

Tests whether reversibility is felt, not just available.

Session → export ratio

Baseline · Not tracked

Target · Improves vs. Gen-1

End-to-end signal that the staged pipeline produces finished work, not just drafts.

Design takeaways the research validated

Options, not verdicts

AI should return choices, not one final answer.

History, not overwrite

Every generation should be recoverable.

One workspace

Editing should stay in context.

Progress, not waiting

Generation should feel active, not frozen.

Reflection

What I would do differently — and push next

The framework holds, but three things stand out when I look back at the study and the redesign.

What I underestimated

How much of the trust gap was language. Terms like “InPainting” and “Storyboard” meant different things to different participants. Naming would have been worth another round of research time.

What I’d test first

Gen-2 prototypes with 6–8 new participants, focused on the visual-editing flow (the 33% completion drop). Then a 4–6 week longitudinal study — moving from “can I complete this?” to “does this tool grow with me?”

What I’d push next: Ask Genie

Hover tooltips were a temporary scaffold, not a teaching system. The next step is an agent layer that watches user actions, explains tools like Inpainting in the moment, and proactively suggests the right control pattern — embedded in the editor, not a chat box beside it.